Cross-Provider Cloud File Backups (Rackspace CloudFiles to Amazon S3)
Having cross-Cloud backups running on a nightly basis is recommended for any business, regardless of size. Although doing backups is fairly trivial, there are a lot of gotchas that come into play when dealing in the Cloud, especially if you want to move files from one Cloud provider to another. In this post, we'd like to share processes and code that we use when backing clients up from Rackspace CloudFiles to Amazon S3.
Using MD5 Hashes (eTags) and file descriptors, we've optimized the operations to perform only rsync-style incremental backups of only the objects that have changed. Although Initial backups will be slow, the incrementals should be very quick as the scripts will only backup file adds & file deltas.
Note: Although we're using the Rackspace CloudFiles API to access CloudFiles, another route is to use CloudFuse (which we plan on writing about in a future blog post). CloudFuse allows you to mount a remote CloudFiles container as though it were local. It's a whole lot faster than using the Rackspace API, but it does have a severe 10,000 file limitation (CloudFiles Limitation) that we find too low for most customers.
For the lazy (and trusting), we've put together an entire tarball of everything you'll need here:
That tarball contains only the minimal files you need from The Rackspace Cloud PHP API and the Amazon S3 PHP Class.
If you'd prefer to download and install everything separately, here are the links:
quicloud's Rackspace / Amazon Wrapper Code
Rackspace's Cloudfiles PHP Library
Amazon's Standalone S3 Rest Library
If you use the Rackspace & S3 sources above, please make sure that:
- You unzip those libraries in the proper relative paths for the wrapper ("./includes/rackspace" and "./includes/amazon").
- You MUST change the CloudFiles 'CF_Object' to expose "etag" as a public (not private) var.
We installed the scripts at '/etc/cloudfiles-to-s3/', and the wrapper (rackspace-cf-to-amazon-s3-bkup.php) has hardcoded links to that directory, if you decide to install elsewhere, please update those links at the top of the wrapper. The quicloud tarballs all have "cloudfiles-to-s3" as the top level directory, so unzipping them in "/etc" should work just fine.
Now that you have all Libraries installed on your server, you simply need to configure the wrapper with your CloudFiles and S3 authentication and container/bucket locations.
Populate all of the following variables in the 'rackspace-cf-to-amazon-s3-bkup.php' wrapper file:
$rackUser = 'RACKSPACE_API_USERNAME';
$rackAPIKey = 'RACKSPACE_API_KEY';
$rackContainer = 'RACKSPACE_CONTAINER_NAME';
$amznAPIKey = 'AMAZON_KEY';
$amznAPISecret = 'AMAZON_SECRECT_KEY';
$amznBucketName = 'AMAZON_BUCKET_NAME';
Validate with Test Run
Set those variables, and let 'er rip by running the wrapper:
The "-v" (verbose) switch will output extended information, detailing file-by-file handling (whether the file was put to S3 or skipped because it already exists). After you see a few lines of output you can halt the script with a CTRL-C from the terminal.
You should see some output like:
start run at 2010-10-04 08:10:01
Loaded Rackspace Container (5 Objects)
Loaded Amazon Bucket (0 Objects)
Putting '/index.html' up to Amazon bucket 'my_backups'
Putting '/images/logo.gif' up to Amazon bucket 'my_backups'
Putting '/images/myhappypic.gif' up to Amazon bucket 'my_backups'
After a trial run in verbose mode, browse your Amazon S3 Bucket and verify that the files got uploaded. You can run and halt a couple of times to watch the verbose output report which files it's loading and which it's skipping (because they're already in your Amazon bucket).
Once you're satisfied that everything is operating correctly, you can run without halting to put everything up. To remove the "Skipping / Putting" messages for individual files, run the script without the "-v" (verbose) switch.
Before allowing it to fully free run, though, you may want to calculate your bandwidth costs -- Rackspace will charge you outgoing & Amazon will charge you incoming. Containers with either large files or lots of Objects will take a looooong time to copy over -- you can back-of-the-napkin estimate your initial upload time by allocating 1 hr per 10000 Objects (assumes your objects are standard Web content -- HTML files & Images). We highly recommend purchasing and using Bucket Explorer to browse and manage your Amazon S3 Files, and watch that the initial S3 population is going smooth.
Add to Crontab for nightly incremental backups
Once you're happy with your initial backup, you can add incremental nightly backups to your crontab. Add the code below to your root crontab to run your backups at 2:10AM every night, dumping the "last run" status output to the file "/etc/cloudfiles-to-s3/last-run.log":
10 2 * * * /etc/cloudfiles-to-s3/rackspace-cf-to-amazon-s3-bkup.php >
Validate Restoring Abilities
Initial backups are verified, nightly backups are set up... what could possibly be left? Just the most important and probably most overlooked part -- verifying that you really can restore your data from your backups. Make an agreement with your IT Lead that on a (relatively) sane day in your IT department, you'll have a drill whereupon the manager will make local backup copies of a couple of non-critical files on your live server, and then delete them from the live server, mimicking a catastrophic file loss. For bonus points, have your manager start a stopwatch when the files are removed & see how long it takes your team to get the files back from Amazon. You may even want to build some "S3 to CloudFiles" scripts which your team can use to programmatically restore from.
Hope you enjoyed this tutorial and that you find the provided code useful. We'd love to hear your feedback on your experience with cross-provider file backups, and any "backup war stories". What have you seen work? Where have you seen the dragons lurk?