Posts Tagged AWS

Top reasons Amazon EC2 instances disappear

(Judging by the posting gap, the author of this blog almost disappeared too! Time to lift the head from the day-to-day scramble and write the next entry!)

The fact that Amazon says up-front that computers fail seems to be the number one concern and criticism of EC2, specially from people who have not used it extensively. I don’t actually spend much time thinking about that because in our experience it’s not something to worry about. It’s essential to take into account when designing a system: whenever we set something up on a machine we immediately think “and what do we do when it fails?” That’s good thing, not a bad thing as anyone with production datacenter experience can attest.

Since it’s such a hot topic, I’ve been keeping a close eye on all the “my instance disappeared” threads on the EC2 forum, and it’s not easy to sort them out. I have no doubt that the vast majority has to do with operator error:

  • trying SSH and forgetting to open port 22 in the security group (or similarly with other ports)
  • having difficulties with the SSH keys, or forgetting to set-up a key to begin with
  • using/constructing an AMI that does not have SSH properly set-up
  • using/constructing an AMI that does not boot properly (network and/or sshd issues) and failing to look into console output
  • instance reboot failing, for example disk mounts failing due to mount point changes that were not reflected into init scripts
  • sshd killed by kernel out-of-memory reaper, failing to look into console output for diagnosis
  • … and many more

Some of these are beginners failing to read the getting started guide, some are more subtle and can happen even to veteran EC2 users. Then there are emails from Amazon saying “we have noticed that one or more of your instances are running on a host degraded due to hardware failure” and I wonder how many users don’t get these emails because their AWS account’s email address points into a bit bucket.

No doubt there are real failures as well where a host dies and takes the instances with it, or one of the disks used by an instance gives up which is the end of that instance. The question here is how frequent this is relative to the total number of instances running, and since Amazon is so secretive with their numbers it’s really difficult to make even an educated guess. I tried to go back into our year of logs to see whether I could estimate the failure rate, but I don’t have enough data to distinguish failure from shutdown, sigh.

The failures that concern me the most are actually not instance failures but network failures. Anyone having set-up a large datacenter will know that network issues are the most difficult to get under control. The damn network just keeps changing, and as soon as you try to hold still your service providers change stuff. Some of the instance disappearances are really network issues that cause an instance to be unreachable, or unreachable from certain other instances. These are hard to troubleshoot and on more than one occasion I’ve had to run tcpdump on both ends to see packets departing and never arriving. If I can get to the target instance at all to run tcpdump, that is… I hope Amazon gets a better handle onto this type of failure and provides us with better troubleshooting tools. In the meantime, it’s important to flag issues to them so they can troubleshoot and eliminate the root causes.

The really good news is that the Amazon folks are very dedicated to figuring out what’s going wrong and fixing it. So if you have an issue, be sure to do the troubleshooting you can, then set the instance aside, launch a new one to take its place, and post all the details on the forum. Shut the instance down only after the issue is looked-at. Looking back, there were two big issues causing instance termination, one was the day where some EC2 front-end explicitly terminated a bunch of instances by error. Not good, but from what we saw it wasn’t a massive failure either. They clearly have done their best to ensure this doesn’t recur. The other was an instance reboot bug which caused many instances to die in the reboot process. We learned not to reboot ailing instances but instead to relaunch and rescue any data. This issue also seems to be fixed at this point.

  • To summarize, if you can’t reach an instance, here is what you should do:try to SSH, check the security group
  • distinguish SSH timeout from key issues (timeout vs. permission denied type of errors)
  • use ping to test connectivity (enable ICMP if you have the bad habit of disabling it)
  • check the console output (use the convenient button on the RightScale dashboard), note that it can take a few minutes for stuff to appear
  • look at the RightScale monitoring to see whether the instance is still sending monitoring data
  • hop onto an instance in the same security group and try connecting from there (launch an instance if you don’t have any)
  • post details (instance ID, what you’ve tried, symptoms observed) on the forum and set instance aside

All in all, the number one lesson is “relaunch!” There are thousands of instances waiting to be utilized so use a fresh one if you see trouble with an existing one. If you master this step you can use it in so many situations: to scale up, to scale down, to handle instance failure, to handle software failure, to enable test set-ups, etc. If you use RightScale you will notice that that’s also what we focus on: making it easier to launch fresh instances.

Comments (2)

Deploying many Rails sites onto Amazon EC2

One of our customers is deploying many Rails sites onto EC2, more precisely, many instances of virtually the same site. Basically they have a Rails application and they tweak it for each individual site they set-up. EC2 is a wonderful deployment platform for this type of business because there is very little friction in adding customers since it takes just one button press to get more servers.

The overall architecture concept we’re using for this customer is to build a number of app+database clusters and to load multiple sites onto each one. The number of sites per cluster can be adjusted such that the database portion of each cluster is loaded up optimally, and it’s designed such that sites can be moved around easily, for example to offload a cluster that may have become too heavily loaded as some of the sites on it have grown.

In the end, the architecture boils down to having two instances running a mysql master/slave set-up managed by our Manager for MySQL plus two instances running load balancers and Rails/Mongrel as redundant app servers. This makes for a fully redundant cluster on which a number of sites can be hosted. It is also easy to add a few more EC2 instances running Rails depending on the Rails vs. MySQL workload balance.

Each site on a cluster has its own logical database (i.e. what MySQL calls a “database”), this makes it easy to backup and restore a site individually, and most importantly, to move a site to another cluster in order to free up resources on the original one. The sites on a cluster can also share the app servers as long as there is no HTTPS involved. The reason for this caveat is that each Amazon EC2 instance has only a single IP address and it is not possible to do “virtual hosting” with HTTPS sites. With HTTP all the www.site1.com, www.site2.com, etc. DNS entries point to the same two load balancing instances (using what’s called “round-robin DNS” for fail-over purposes) and the load balancer (or front-end Apache, if used) figures out which site the user is visiting based on the “host” header included in every HTTP 1.1 request.

What’s really nice about these 4-6 machine clusters is that they’re very powerful yet so simple. There’s no “infinitely scalable” magic under the hood that breaks at the worst moments. No, it’s a plain set-up that anyone with a bit of experience can fully understand. The magic is that it’s so easy to set these clusters up with Amazon EC2 plus RightScale so you can really take advantage of the same “horizontal scaling” as the big guys (Google, Yahoo!, etc.).

One of the interesting design decisions with all this is how to set-up DNS. For example, the app for site1 needs to locate the IP address of the database it’s supposed to talk to. We use DNS as follows:

  • the app connects to db-master.site1.com
  • db-master.site1.com resolves to a CNAME for db-master-cluster3.company.com
  • db-master-cluster3.company.com resolves to the IP address of the instance that currently hosts the master
  • DNS for db-master-cluster3.company.com is set-up with a low TTL (we use 75 secs) and supports dynamic updates
  • if the DB master crashes or is otherwise replaced the db-master-cluster3.company.com DNS entry is automatically updated by the RightScale MySQL manager, which switches all the sites hosted by that cluster over with one stroke
  • if site1 is moved to different cluster, then the CNAME has to be updated to point to the correct cluster DB

For the web sites themselves its also nice to use CNAMEs:

  • www.site1.com points to www-cluster3.company.com
  • www-cluster3.company.com resolves to the IP addresses of all the load balancer instances
  • if a load balancer instance is restarted, the www-cluster3.company.com entry is dynamically updated
  • if the site is moved to different cluster, the CNAME needs to be updated

Wow, it’s amazing all this is actually possible and not just a dream! Amazon EC2 enables it and RightScale makes it possible to manage without an army of sysadmins running around and tweaking servers all the time.


Archived Comments

Dan Croak
Do you have one of your pretty graphics to go along with this setup? For all us visual learners out there… :)

Sushi
A script/tutorial/template like “rails-on-ec2-standard” post will be of great help.

Thorsten
Dan – good suggestion, I’ll pull out my brushes and paints…

Sushi – I fear this is all still a little too new to condense into a tutorial. It’s one thing to set everything up for one site, but another story altogether to figure out a solution that covers many different sites. What is needed is a design pattern that can be used in many slightly different situations and so it needs to be adaptable while also keeping the core that makes it tick constant. We’re not quite there yet.

Comments (1)

RightImages changelog

We are currently maintaining CentOS and Fedora Core RightImages. The motivation for these images and the way we build them automatically using a script is described in our RightScripts rationale blog entry. The short version is that we keep the images up-to-date with distribution releases, we add all the basics needed for operation under EC2, and we add scripts designed to make them work hand-in-hand with the RightScale features. As a community service, we are making the image creation scripts available for free so you can build your own images if you want to tweak what we provide.

We have received a number of requests for a similar Debian/Ubuntu image. While we’re excited about something like this and are keeping it on our to-do list, the truth is that (a) unless we have a customer we probably won’t get to it, and (b) we are focusing all the new RightScale goodies on RedHat based distros, not that things wouldn’t work under Debian but it’s more work to test it all. In any case, we’re open but also very busy…

October 23rd RightImage CentOS5 X64_64 V1_12

  • CentOS5 X86 64-bit image for Amazon EC2 large instance types (“large” and “x-large”)
  • AMI-ID: ami-31c72258
  • Script
  • Changes:
    • Same as 32-bit V1_10 but for 64-bit architecture

September 26th RightImage CentOS5V1.10

  • CentOS5V1.10 RightImage
  • AMI-ID: ami-08f41161
  • Script
  • Changes:
    • CentOS5(core,updates) mirror on S3
    • RubyForge Mirror on S3
    • New Software
      • sysstat
      • rpmbuild
      • fping
      • rrdtool
      • bwm-ng

      Image size reduced from 880MB to 500MB.

July 10th RightImage CentOS5V1.6

  • CentOS5V1.6 RightImage
  • AMI-ID: ami-9a9e7bf3
  • Script
    • Added development tools gcc,gcc-c++, automake, bison, flex, cvs, subversion (probably the most requested change)
    • Added SSH keep-alives (large ClientCountMax setting to accommodate clients that do not respond to keep-alives)
      • ClientAliveInterval 60
      • ClientAliveCountMax 240
  • Gems – net-ssh, xml-simple, net-sftp installed by default. ‘
  • S3Sync updated to version 1.1.4
  • Cron Job to check for AMI-Tools update.
  • Ruby no longer errors about /home/ec2 being world writable.
  • getsshkey service added to fetch ssh key from EC2 early in boot.
  • RightScale service moved to position 81, after ssh and postfix.
  • Yum default repository for Base, and Updates now internal to improve boot times.
  • Many new rightscale service features. Please see upcoming blog entries for new features.

April 27th, 2007 RightImage FC6V2 and CentOS5V1

  • FC6V2 RightImage: AMI-ID ami-2c8f6a45 and script
  • CentOS5V1 RightImage: AMI-ID ami-268f6a4f and script

Comments (3)

Setting up site on EC2 with RightScale

The key to a successful site setup on Amazon EC2 is scalability and redundancy. RightScale makes this easy by providing server templates and multi-server deployments. To get started, let’s take the simplest case: a single server set-up. We have a free “Rails all-in-one” server template that is excellent not just to play around with, but also to use as development server, staging server, or even as production server for small sites that don’t need more horsepower or much redundancy.

Single Server Site

Our Rails all-in-one is described in more detail elsewhere, but you see on the right what’s involved: it runs Apache as a reverse proxy in front of 4 Mongrel/Rails processes all backed by a simple MySQL installation. Last, but not least, we set-up cron jobs that run a mysqldump every 10 minutes to Amazon S3 so you have your data safe in case the instance dies unexpectedly. Apache in the front can be set-up to serve up static and cached pages, it can do HTTP and/or HTTPS, it can canonicalize the hostname (e.g. redirect http://mysite.com to http://www.mysite.com), and it can serve-up a maintenance page while you’re updating your app. Oh, of course Apache load balances across the 4 Mongrels too!

Redundant site

Ready for more? You’re almost ready to launch for real, you expect some traffic soon and don’t want to be reliant on a single server anymore. Time to upgrade to a fully redundant site architecture using 4 servers!

Site - Redundant Setup
The set-up almost all our customers use consists of two front-end servers and two back-end database servers giving us full redundancy. We use this ourselves for the RightScale site itself! Let’s walk through the set-up from beginning to end.

It all starts when a user types http://www.mysite.com into the browser. The browser does a DNS lookup and gets two IP addresses which are the public IPs of the two front-end instances. The browser picks one and tries to connect. If it fails, it rather quickly tries the other, this gives you the fault tolerance you need in case one of the instances dies or has other problems. Also, having multiple IP addresses for your site is the only form of fail-over that browsers support, see this page for additional details.

The first thing the request from your browser hits is Apache, which has the same roles as in the all-in-one server: dealing with SSL, canonicalizing the hostname, serving up static files, putting up a maintenance page, and anything else you might want a full-fledged web server for. For requests destined to your application, Apache acts as a reverse proxy and forwards the request to HAproxy on the same machine.

HAproxy is a very nice piece of software that proxies and load balances requests to back-end servers. We use it for HTTP here, but it can also do plain UDP and TCP load balancing, for example for DNS or mail servers. We chose HAproxy because it has good support for health checks and the ability to redirect requests to alternate servers if a back-end fails mid-way. HAproxy is set-up to send a request to each back-end process (Mongrel/Rails in our example) to ensure that it’s running properly. It then only forwards requests to servers that respond. While Apache can do load balancing across multiple back-end servers as well using mod_balance_proxy it does not include health checks. What this means is that when a sever goes down it has to send live customer requests to it every few seconds to see whether it has come back up. This means that while any Mongrel process is down on any server your customers are going to be impacted because some of their requests are being sent into a black hole. Not nice…

HA proxy forwards the request to one of the Mongrel/Rails processes on either of the two servers. Load balancing across both servers is nice because it means that you can shut the Mongrels on one server down to update the code without impacting customers at all.

Everything on the front-end servers is open source software except for your application. So we need a way you can get you app code onto the instance at boot time, and a way you can update the code. Note that for major upgrades we always recommend to launch fresh instances so you keep the old ones around for a day, just in case you want to switch back. (Hey, that’s really cheap insurance at only $2.40 per day per server!) We provide two different RightScripts to do minor code updates: one pulls the code from a tarball located on S3, the other does an svn export from your subversion repository. We recommend the S3 route for production use because else starting new servers depends on the availability of your svn repository and often the svn export is the slowest portion of the entire instance boot process. But sometimes the svn route is just so much more convenient, specially if you’re playing with a test set-up where you change the code frequently. In addition, for Rails, we set-up the app code directory structure the same way capistrano does, so you can point your capistrano config file at your instance and do a “cap update”. Again, something we don’t recommend for production servers but really handy for test and dev boxes.

Behind the front-end servers we place two replicated MySQL instances managed through our Manager for MySQL with backups to Amazon S3. We use frequent backups from the slave server where the load of the backup itself doesn’t affect production and daily backups from the master as added security.

Scalable redundant site

For a fully redundant and scalable site we recommend an architecture that is a natural extension from the 4-server set-up using more of the same components. We basically add a number of Mongrel/Rails application servers and hook them into the load balancing rotation on the two front-end servers. This array of app servers can now be expanded and contracted as warranted by the load on the web site: expand to handle surges in traffic when your PR and marketing lands a success, contract at night when the load on your site goes down and you’d rather hold on to your $$. The wonderful thing is that with this set-up you are paying for the average cost of your hosting needs, not for a once-a-month peak!

Scalable Setup

If you look closely we’re running the app server on the two front-end load balancing instances. We find that the load balancing takes very few resources and that there’s room for some application cycles. Using HA proxy it’s easy to have less traffic go to the local app servers than to remote dedicated instances. The reason we keep the app on the front-end instances (as opposed to switching to pure load balancing instances) is that this way there are always two app servers available even if the array is scaled back to zero servers. Or put differently, when your site is under minimal load at 4am it scales down to 4 instances as opposed to 6. If the load-balancing or serving of static files becomes a significant load, it is of course possible to switch of the app serving on the front-end or, alternatively, to add 2 additional from-end load balancing instances.

The way we currently handling the changes to the load balancer config when servers come online is to automatically edit the config file using operational RightScripts and do a seamless restart of HAproxy which ensures that no connections are dropped in the change.

If you are interested in using our site setups please don’t hesitate to try out the free Rails all-in-one server template and please contact us for more at sales@rightscale.com. The multi-server set-ups are not available in pre-packaged form with the free RightScale accounts.


Archived Comments

kai
say a site requires HTTPS. looking at the pic, is everything between Apache and Haproxy, Haproxy and Mongrel, and Mongrel and MySQL not encrypted?

What’s your solution for encrypting all the sensitive data? Security can only be as good as how you deal with the weakest link.

Thorsten
Kai, you are absolutely correct. At the same time, security is a relative concept, there is no absolute security. The back-end traffic is “secured” by Amazon’s network configuration and security groups firewall system. For many sites it is acceptable to have the back-end be unencrypted because the threat is crossing the internet and specially wifi or similar networks at the client end.

We would prefer to have a way to re-encrypt the back-end communication, but at the moment this is not so easy to do given the software load balancers out there (if you have a suggestion, we’d love the hear it). I wish we could drop a netscaler load balancing box into EC2, but I don’t see that happening!

The interim solution we’d use if a customer asked for it is using HAproxy in a TCP balancing mode where it connects TCP streams through to the back-end server. This would connect the SSL connections all the way through to the app server. We’d then have to put the mysql connections through encrypted tunnels to secure that part as well. All this is perfectly doable but it’s getting awfully close to requirements where the outsourced nature of EC2 may not fit the bill no matter how many encryption layers you use.

Comments (2)

The 10-minute EC2 server

The new Rails All-In-One server template we just made public makes it really easy to get your own Rails application running on Amazon EC2. And it’s all free to boot!

The server template is a collection of RightScripts that install Apache, mod_balancer, Mongrel, and MySQL, and a backup cron script all on one EC2 instance. All you need to do is to specify where your code is located and launch the whole thing! This all-in-one server template is excellent for a number of purposes:

  • kick the tires of Amazon EC2 — see your own app running and play around
  • launch a simple site – many small sites don’t need more, the traffic load isn’t high, and if the site is down a couple of hours every few months because of some problem with the instance then that’s not the end of the world
  • do some development — if you need an extra dev server then this is yours cheap, whether for a few hours or for days
  • try something out– want to turn your app upside-down and see how it holds up, don’t mess up your own server or laptop, instead launch-wreck-discard an EC2 instance as many times as you please

Make no mistake: this server template is neither a black box nor a toy! All the configurations are available for you to inspect and modify. You can clone the template and replace the pieces you want to design differently. You can also add additional functionality or even split the server in two and run the database on a separate instance for a better performance. This is not a canned set-up like most hosting shops provide: instead, it’s a starting point that you can customize to your needs and wants. If you’re set to grow and are looking into multiple servers, check out the site architectures we recommend.

We’re readying a short how-to that takes you through launching your own Rails app using the server template step-by-step. In the meantime, you can also easily launch a demo server with Mephisto: it’s the same template but we have adapted it to get Mephisto onto it.

OK, so how do you get started? Here are the steps:

Log into RightScale

Log into RightScale, or, if you don’t have a RightScale account yet:

  • sign-up for a free account.
  • Go to your email inbox, look for the validation email, click on the link to get as couple of hours of EC2 time
  • Alternatively, if you have an EC2 account enter your credentials into RightScale (Settings > My Account > Credentials) and thereby get more features enabled
  • Create an SSH key, if you don’t have one: Design > Ec2 > SSH key

Launch a server running Mephisto for you

Mephisto is a blog engine written in Rails and the server you are about to launch has a generic Rails set-up plus the Mephisto app so you can see something in action with the fewest steps.

  1. Swing over to the Server Templates using the Design menu at the left and locate the Mephisto All-In-One v1 demo server template:
  • Before being able to launch it, you need to specify an SSH key (so you can SSH into the server) and a security group (corresponds to ingress firewall settings): click on the edit icon on the right
  • Select an SSH key and a security group, ignore all the other settings, and hit “save” at the bottom (yes, we’re making this easier soon)
  • Now hit the “Launch” button at the top of the page and you see a page with all the settings you can change for this server template. Ignore most of them for now and put something (root@localhost will do) into the ADMIN_EMAIL field, which is empty. Then hit “Launch” again at the bottom.
  1. Watch the instance that will run a demo app appear as “launched” in your Recent Tasks pane on the left: it will take 2-3 minutes to start booting, and then another 6-8 to install and configure all the software.
  2. Sit back, relax and watch the server go through its boot process until it shows “operational” in the Recent Tasks pane. This may take 6-8 minutes. We’re working on reducing the boot time: most of it is actually taken by the gem install commands!
  3. Once your server is operational, you are ready to use your brand new server. Go to Manage > Active Servers and click on the server’s DNS name (ec2-67-000-0-00.z-1.compute-1.amazonaws.com or similar), that should bring you straight to your own Mephisto instance!
    • Quick troubleshooting: if everything looks ready but all connections to your server simply time out make sure you have ports 22 (SSH) and 80 (HTTP) open in your security group setting: Design > EC2 > Security Groups, add IPs: “tcp 0.0.0.0/0 ports 22..22″ and “”tcp 0.0.0.0/0 ports 80..80″.
  4. Start using the app. For example, edit the url by appending “/admin” and log in using the default Mephisto user/password (i.e., “admin” and “test”), hit “create new article”, type “Hello World!”, save, go back to the root URL of your server and voila! your first article in Mephisto on your server on EC2. Woot!
  5. Remember that you have just launched servers that are inexpensive, but do cost money. You can check your active instances at Manage > Servers > Active Servers, and terminate any which you’ve finished testing.

So this canned Mephisto all-in-one template is a good example to get your feet wet in bringing up a complete rails up in EC2. But unless you actually need Mephisto, this doesn’t get you that much. If you are a Rails developer you will want to bring up your own app…so stay tuned for our coming step-by-step guide to launching your own Rails “All-In-One” server using RightScale.


Archived Comments

xxx Praful
very nice to see the setup process in windows but i wanna setup in Linux will any help is available there!

Comments (3)

Amazon EC2 changes how MySQL is used

Amazon EC2 will change the way MySQL is used: it suddenly opens a whole slew of new possibilities. What’s really exciting is that it can also simplify the management of MySQL which enables powerful automation as provided in the [RightScale Manager for MySQL](/2007/08/20/redundant-mysql). The factors that enable this are:

  • it takes <10 minutes to fire up a fresh MySQL instance, and there’s a virtually limitless supply
  • there is virtually no cost to keeping old database instances around while setting up fresh ones
  • there is virtually no cost to firing up temporary database instances for special tasks

Let’s take an example: you have 1 master and 2 slaves hanging off it. Your master fails. You determine that slave 1 is at a more advanced position than slave 2, so you promote it to master. How do you proceed to get back to 1 master + 2 slaves? Without some risky magic you can’t roll slave 2 forward to the same position as slave 1 and start replicating from slave 1, so slave 2 is more or less useless. On EC2 you can discard the master and slave 2 and fire up two new slaves that you set-up fresh to replicate from the new master. In addition you can keep slave 2 around for a few hours until everything is stable again just in case a problem develops with slave 1 or you discover that you need to roll back a few transactions because something caused a problem.This is very different from a physical set-up where you would have promoted slave 1 to master, “fixed-up” the original master machine and made it a slave of the new master, and where you would have wiped the data from slave 2 to set it up fresh. Both of these actions sound simple, but in real life they often end up being very stressful. Ideally you don’t want to touch the old master so you can do in-depth troubleshooting so you find out what went wrong and can fix the problem. But you also badly need the old master machine to be back in the cluster making for a difficult choice. Wiping data from slave 2 is also not an easy decision: what if there is a problem with slave 1, or what if you really need to run some of your read-only applications in degraded mode off the now-frozen slave 2 until you have enough machines in the cluster to take the full load? Again, on EC2 all this becomes easier: you fire up fresh instances and simply keep the old machines around until you’re confident that the new ones are ready to take over and you’ve gotten all the information you wanted off the old ones.

If you are interested in using our Manager for MySQL, please contact us at sales@rightscale.com. This stuff is not available with the free RightScale accounts.

Leave a Comment

Redundant MySQL set-up for Amazon EC2

In order to deploy web sites/services onto Amazon EC2 everyone needs the same components, and so we’re building them! One of the most requested and most critical pieces is a good database set-up, and mysql is clearly the highest in demand. Not that a good postgresql or oracle set-up wouldn’t be of interest or would be equally possible, just that more people are asking us (and paying us) for mysql…

What we’ve built is a mysql master/slave set-up with backup to Amazon S3. The set-up consists of one mysql master server which is the primary database server used by the application. We assume it runs on its own EC2 instance but it could probably share the instance with the application. We install LVM (linux volume manager) on the /mnt partition and place the database files there. We use LVM snapshots to back up the database to S3, this means that we get a consistent backup of the database files with only a sub-second hiccup to the database.

MySQL Master and Slave Setup

Well, the snapshots for backup are actually quite a bit more complicated than that. We have to acquire a read-lock on all tables and this could block things if there is a long running query ahead of us. So there’s a timeout and retry loop which needs to balance off locking up the database and getting the backup done.

Using the snapshot backup we set-up a slave instance which then starts replicating in real-time from the master. This means that all changes to the master are propagated with milliseconds of delay to the slave, so should the master instance fail, there’s an up-to-date backup. On a master failure we promote the slave to master and set-up a fresh slave. Note that in most databases the slave lags extremely little behind the master. The main situation where the slave starts lagging is when there is a lot of write activity going on in the master. Under heavy write load the slave is slower at applying the replication to its copy than the master on the same hardware because the slave uses only a single thread to apply all changes while the master has one thread per client connection, so it can overlap network communication, cpu processing, and disk I/O using multiple threads, which the client can’t.

Periodic backups are taken off both the slave and master instances. There is very little penalty for acquiring a read lock on the slave and performing the snapshot and subsequent back-up, so it can be done every few minutes without any real impact (unless the slave has trouble keeping up as described above, in which case it’s probably time to move to multiple slaves). We also take infrequent backups on the master, say once a day, in order to guard against any problems introduced by replication.

While the mysql replication is well proven and used by many large sites in heavy production, there are failure scenarios. First of all, the application should use Innodb tables exclusively because myisam tables are not transactional and have a number of scenarios where replication fails. Even with innodb tables there are failures possible. For example, it is possible to write non-deterministic queries in SQL and since mysql uses logical replication the slave re-executes the query, and it may end up using a different execution order than the master, resulting in different data in the database. Ouch. One example is a create table with an auto-index key using a select from an existing table. The insertion order and hence the keys in the new table depend on the order in which the select is executed, and if it’s executed in a different order in the slave from the master you will end up with an unusable slave DB! (Been there, done that, it still hurts.) Thus: do back-up your master every now and then to be able to recover from such problems. (If you’re paranoid, fire-up an instance every few hours, load up a back-up, and run a few consistency checks — it’ll cost you less than a buck a day to ensure the DB backup is good, that’s cheap insurance.)

The best of all is that all the goodies described above are controlled through the RightScale web interface. Want a new slave? Just press the “set-up slave” button! Want a back-up, just press “backup” on the master or on the slave. The list of functions we have now are:

  • launch database instance
  • restore from S3 backup and configure as master
  • configure as slave, using DB transfer from master for initial state
  • promote slave to master
  • backup to S3
  • daily backups to S3 from master
  • 10-minute backups to S3 from slave

We obviously still have a lot of work ahead of us to improve the flexibility of the set-up. One thing to note is that you are in control of what is executed on the database servers, so they are not opaque virtual appliances. If you need to tweak our database install, slave set-up, backup, or other code, it’s all available in scripts that you can modify. (Of course the more you modify the less we can help when things go wrong.) Also, currently all these functions are “automated” in the sense that you make a decision, push the right button, and things happen. We are adding monitoring and we will add triggers that will cause master-slave failovers automatically.

If you are interested in using our mysql master/slave set-up, please contact us at sales@rightscale.com. This stuff is not available with the free RightScale accounts.


Archived Comments

Pete
Any plans on making this available as a feature you can pay for without signing up for one of those expensive packages?

Thorsten
Pete, thanks for the interest. Obviously “expensive” is in the eye of the beholder. After having spent the days putting all this together, I would rate the price as really low. Together with the help you get to put it all into operation, it’s probably too low to make a profit. Maybe a bit down the line when we have it fully automated and cookie cutter we can reduce th price.

When I started using EC2 I realized that it’s really easy to blow any savings away in sysadmin time. You end up launching and installing servers at an insane rate, because you can, and because it brings sooo many side-benefits. You can do things you didn’t even conceive of before. But someone or something has to install and manage all these servers. If it’s “someone” then that’s really expensive really fast. Hence “something” = RightScale. I hope that when you sum it up at the end of the month the sysadmin you didn’t need to hire thanks to RightScale saves you much more money than the check you write us. YMMV of course.

dennis
I think the prices are great for established businesses, but I think that the smaller startups (who are doing the sysadmin stuff on their own time) would love to be able to launch some of the database stuff as part of the free account…

Paul
Nice use of LVM for the snapshots, just be mindful that long running queries can cause the read lock to take ages to be in the place and have the database in a consistent state. Pete Zaitsev mentioned in one of his cookbook presentations that using FLUSH TABLE is a way to speed the process up. I am also reviewing the mysql-table-checksum from the MySQLtoolkit as a way to make sure the master and slave remain completely in sync. Interesting set of tools as well.
Have Fun
Paul

Thorsten
Paul, thanks for the pointers. Yes, we understand that long running queries can delay the read lock. And if you’re not careful, your read lock request can block a pile of other queries behind it, so it’s not good to just sit there and wait for the read lock to go through. We use a relatively short timeout on the read lock and try again, and again.

Comments (9)