Blog

Disaster Recovery (DR) Options in Cloud Environments

At the OpenStack Summit in Austin (2016), several presentations occurred regarding Disaster Recovery; most interesting was, “Application Level Consistency Snapshot and Disaster Recovery.” My time spent in the DR realm of a Fortune 100 financial services company, I am quite passionate about Business Resumption and Business Continuity. With the release of OpenStack Newton in October 2016, it appears application consistency may get baked-in. Application Level Consistency Snapshot is exciting news for those of us who are passionate about Disaster Recovery, over the coming months, I’ll be keeping an eye on this.

As an organization looks to deploy OpenStack integrators, providers, and operators may find it beneficial to place workloads in various cloud “buckets.” These buckets make workload placement a lot easier when working towards an OpenStack-based cloud. These basic very broad buckets are:

Non-cloud Friendly

Workloads that require everything to be done manually, including the request for the server (including make, model, CPU count, RAM, storage), storage (SAN/NAS/NFS/none), the network  (with supporting cabling diagram), with supporting processes that are well suited for extended deployment times. This process might also use Excel spreadsheets, paper request forms, any maybe an old sneaker-net chat applet.

Cloud Friendly

These are workloads that use distributed services and are virtualized. They use a source code manager with their application code, and some configuration management and automation engines (Ansible, Puppet, Chef, etc.). The support processes are designed to support agility and consistency, more than the legacy processes used to support bare metal applications with really long deployment times.

Cloud Native

Workloads built with OpenStack or another cloud at its core. These workloads are built from the ground up using a mature deployment and development pipeline (e.g. a CICD pipeline).

Note: “OpenStack does not equal CICD.”

Each workload has several moving parts, in very broad terms:

processing (CPU, CUDA, etc.)

memory (RAM)

storage (NAS, NFS, SAN)

network

application

support teams (storage team, server team, DB team, app team, etc.)

business unit

 

For the Recovery of workloads, cloud-native workloads just re-deploy, assuming a mature deployment pipeline of course. Many DIY OpenStack deployments are being designed for Cloud Native workloads while becoming the landing place for “non-cloud friendly” workloads. A legacy workload might be an application coming from a 100% virtual environment where the deployment method uses the same process as an application running on a physical bare metal server. When this occurs, these DIY OpenStack operators are quickly overwhelmed by outdated non-cloud friendly processes and support mechanisms.

In the instance of recovery, in a time long past, there were three dominating methods:

Recovery-by-Rebuild

Backup the data, rebuild the server, re-install the application, re-hydrate the data, cross fingers. This method is often used in conjunction with a recovery runbook, finding the backups (be it tapes, virtual tapes, etc.), and a run book on how to install the application (often from when it was first installed), the OS, what the network configuration might look like, etc. In my experience, this has yielded a 25% success rate with an RTO of “best effort.”

Recovery-by-Re-image

Backup the system image, re-image the server, re-hydrate the data, cross fingers. This method is often used in conjunction with a recovery runbook, finding the images, etc. This method is more predictable than the Recovery-by-Rebuild method. On a per server basis, reimaging yields better results, although, as fortune would have it, I’ve not needed to do this on a large scale.

Recovery-by-Restore

Create a snapshot of the system’s state and data, restore the system to a working restore point, cross fingers. This method yields the best results, especially when this method is supported by a GUI of sorts that allow almost anyone to execute a restore with a variable level of training (e.g. VMWare Site Recovery Manager, NetBackup, TSM for Virtual Environments, EMC AVAMAR).

The evolution of cloud models and processes, be it OpenStack, AWS, GCE, CI/CD, etc., has introduced a fourth method into the recovery space:

Recovery-by-Re-Deploy

This is where a recovery is the redeployment of a workload in a very rapid fashion, often fully orchestrated with minimal to no downtime. This method requires a mature deployment and support pipeline. In the context of OpenStack, this is the objective. Using OpenStack doesn’t magically eliminate an organization’s need for Disaster Recovery or Data Protection.

Note: “OpenStack does not equal Recovery-not-Required.”

This fourth method enables an organization to mature how it deploys and builds applications, with a benefit output being rapid re-deployment and quick recovery. At the OpenStack Summit in Austin, there was a keynote that described there being more than tools to change when integrating OpenStack into one’s portfolio of capabilities. As has been my opinion for nearly a decade, the success of driving change into any organization is dependent on the update of four core areas.

  • Processes – These are the processes that define how a workload comes to live, is supported during that life, and the decommissioning of the workload.
  • Culture – Very much like the process component, this is the institution’s organizational culture. An example might be, the current culture depends on external companies (outsource) to resolve internal problems. Example, a company with a corporate culture that doesn’t learn new things, support new initiatives, attempted to adopt OpenStack. Often these companies think OpenStack will magically allow them to take on a “Cloud Culture.”
  • Mindset – Tied closely to the culture, the mindset is an individual’s mindset as set by the organizational culture. Often these are in lock step; however,  a mindset must change and influence the culture. We’ve all met leaders, peers, subordinates, engineers, architects, who were the architects of the legacy methods. Thus in the mindset, OpenStack must accommodate the existing environment as such, unrealistic requirements get pushed into the project, and the whole initiative ends up in the red. Namely, because more time is being spent trying to retrofit OpenStack to accommodate workloads are anything but cloud friendly or cloud-native.
  • Technology – This is the technology (be it software or hardware), arguably, this is the easiest component to change. Unfortunately, this technology shift often leads to unrealistic visions of becoming Google, NetFlix, or AWS. Because it is so easy and has minimal barriers, companies sometimes purchase a product that promises “DevOps,” or “Cloud,” all the while only realizing additional support costs.

Note: “OpenStack does not equal DevOps.”

Many organization still use a process that enables technology silos; these are organizations where different teams “own” a piece of something (think network team, directory team, server team, os team, DB team, etc.). In these sorts of systems, application deployments require many gates and approvals, so much so that this single mindset/culture hinders rapid deployment.

OpenStack is a product, a product that enables something. What that something is, is up to us, as OpenStack integrators, operators, and contributors, we must work beyond the next shiny object, the next buzzword, and work behind the scenes in updating processes, changing restrictive cultural thinking and mindsets, and integrate better technology.

OpenStack works fantastic when enabling workloads designed to run on OpenStack leveraging an upgraded and mature processes. These are applications where application teams, developers, and operations work together using a cloud-friendly process. If more organizations adopted this method of thinking, OpenStack would cease to be the risky adventure many put it out to be and instead be something that could enable an actual shift in the modern data center and how we all consume technology.

Author: Lindis Webb, Cloud Architect, Solinea