KARL and zero-downtime updates

For the KARL project, the development team is primarily involved in operations.  Along with Six Feet Up as the hosting provider, we are responsible for many activities in “SaaS”.  Bugs are reported to us, we do the software updates, we help monitor the site, we do staging and testing of customer instances.

I really enjoy this aspect.  It’s different from past involvement in open source projects, where involvement in the software is somewhat de-coupled from living with your mistakes. [wink]  Stated differently, we have a direct interest in stability, performance, quality, and even in things like making KARL easy to monitor by Zenoss.

We periodically update the software.  Which means restarting the app server.  Which usually means, downtime.  Over time, we’ve whittled that down.

First, KARL restarts fast.  Like, two seconds or so.  Thus, the impact is minimal.  Next, we use mod_wsgi, which lets us do “graceful” restarts in Apache.  Serve all your current requests and restart your processes.  These combine for providing very fast updates.

There’s one aspect that’s harder though.  Sometimes our updates require “evolve” scripts to update data.  For example, adding an index, or fixing a value that requires waking up lots of objects.

We used to do this live in a ZEO client, but when the evolve script takes more than a few minutes, we get prone to conflict errors with the running site.  Which means, shutting down the main site.  Which, sucks.  (I’ve become quite obsessive about performance and uptime.)

We have some ideas we think can mitigate this:

  • SSD.  Six Feet Up has the solid-state disks installed.  Because of RAID and cabinet issues, the SSDs are going in the spare box in the rack.  We’ll then move the site over.  The hope is that evolve scripts get faster.  If the evolve scripts are bottlenecked elsewhere (e.g. ZEO single-threading), then that’s a different issue.
  • Read-only mode.  Perhaps we could leave the site in read-only mode during the update, with a little banner informing the user.  Preferably we could put the site in read-only mode without a restart.

Any other ideas on minimizing downtime on such applications without major changes in architecture?


8 Responses to “KARL and zero-downtime updates”

  1. Duncan Says:

    Surely, provided you don’t commit the changes until the evolve script is finished, the only thing that might get a conflict error is the evolve script itself. You might need to run it more than once before it completes successfully but at least you can be sure you won’t interrupt users.

    Or am I missing something?

    I quite often use self-contained scripts to update our Plone sites and I always assumed that I wasn’t upsetting real users even if I do sometimes have to run them multiple times and/or at quiet times of the day.

    • Paul Everitt Says:

      You’re right that it is the evolve script that pays the price. Some of our evolve scripts, though, touch so many things (particularly catalog index objects) that it is really hard to get them to complete.

      We certainly use a lot ZEO clients as cron jobs in KARL. For example, email-in. (Though we are now converting these do Supervisor-managed daemons.) Those have almost no chance of a conflict error.

      I guess your point is, can we do evolve scripts in a way that is less prone to this problem? Perhaps we make it easy to say, “Run this, if you have a conflict error, just keep trying a hundred more times.”

  2. Lee Joramo Says:

    Paul, is there a video demonstrating KARL3 in action from the users point of view? I have just installed KARL to play with, and it looks like something that a couple of my clients could really use.

  3. David Glick Says:

    Depending on the nature of your “evolve” script, one approach is to operate on items in small batches, committing transactions as you go. For example, Alec Mitchell recently created this ZCatalog rebuild script which reindexes objects in batches, and retries each batch on conflicts: http://svn.plone.org/svn/plone/Products.PloneOrg/trunk/scripts/catalog_rebuild.py

    Of course, this probably makes the script take longer, and it isn’t a valid approach in cases where a partial commit would be harmful.

    • philikon Says:

      Yeah, if you don’t keep track of which objects you still need to evolve etc., you’re going to end up with an half-arsed evolution. In the good old Zope 2 days somebody once showed me their solution to this problem. Essentially they had a separate ZCatalog just to keep track of schema evolution. When the time came to upgrade, they knew exactly which objects to evolve (without walking the whole database) and they could do it in batches — ZCatalog would keep track of it all.

    • Maurits van Rees Says:

      Another example of intermediate commits is in some of the Poi migrations. See http://dev.plone.org/collective/browser/Products.Poi/branches/1.2/Products/Poi/migration.py

      The migrate_responses function searches for issue trackers to update and does a transaction commit after each tracker. For the migration on plone.org this was not enough for some trackers, so I wrote the migrate_responses_alternative function, which has the exact same result, but makes a commit after updating one issue of a tracker, making sure to always keep the committed data consistent and not committing unrelated temporary changes (the setSendNotificationEmails calls).

  4. Nate Aune Says:

    This is quite easy to do with the Amazon AWS infrastructure because you can take live consistent snapshots of an EBS volume where your database file is stored.

    So you could take a snapshot of the live volume, put the live volume into read-only mode, run your migrations on a new volume created from the snapshot, and then when the migration is finished swap out the old volume for the migrated volume. The advantage is that if the migration goes south, you can simply revert back to the old volume, and not skip a beat. This can of course all be automated with Amazon’s API, to minimize downtime and human error.

    This example is for MySQL but the principles would apply also to the ZODB. http://alestic.com/2009/09/ec2-consistent-snapshot

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: