Clearing up Disaster Recovery and Business Continuity Confusion

As I work with different teams, there is often some confusion about what people mean when they say “DR”. Team discussions will delve into RPO, RTO, Backup, High Availability, and Business Continuity. Unfortunately, these concepts get mixed up and confused with each other at times, and depending on the context, people have different perceptions of what these actually are. Let’s clear things up.

It’s important for organizations and architecture teams to align on nomenclature and common understandings of these concepts. I stress this importance because even things that should be straight forward don’t always have a common understanding across the board; for example, **business continuity recovery objectives may imply different things depending on ones understanding or misunderstanding of the concepts.

When these types of differences in understanding occur, the organization is at a disadvantage due to the spin and friction that comes with misalignment. People often tend to misunderstand what business continuity is , and they may misunderstand the relationship between business continuity objectives and DR. Other times, there are gaps in knowledge; I’ve seen architects put together comprehensive solutions to support DR without any knowledge or understanding of what business continuity is. This creates a lot of risk and re-work, but with a little effort to ensure common understanding and alignment, these risks can be removed while ensuring the organization has a standard understanding.

Business Continuity

Before we can understand DR and it’s purpose, we have to understand business continuity and the business continuity plan (BCP).

Business continuity is the continuation of critical business processes during a major outage or catastrophic event. It requires ensuring that there is a plan to support critical business processes. DR exists to supports business continuity, but DR is not business continuity; it’s only a facet of business continuity.

Business continuity has core recovery objectives and SLAs, with specific importance to the RTO, and RPO objectives.

  • RTO – Recovery Time Objective – The planned for recovery time to restore a critical business function.
  • RPO – Recovery Point Objective – The window of acceptable data loss that supports the business in the event of a major outage or disaster.
  • MTO – Maximum Tolerable Outage – The amount of time in which business processes and supporting systems, if required, can be restored along with data or information required for those processes before there is an unacceptable business impact.
  • MTDL – Maximum Tolerable Data Loss – Maximum amount of data loss, as expressed in a window of time, that the business is able to accept.

You may see organizations use different terminology for the above terms; example: instead of MTO, they may use MAO (Maximum allowable outage) or MTPD (Maximum tolerable period of disruption), etc.

RTO is the amount of time it takes to recover critical processes. This does not mean recover systems or recover all systems. The SLA to bring up the supporting systems may be different and longer than the service objective to bring up the critical business processes. The business continuity plan (BCP) would define where manual or alternate processes can be used while underlying systems are unavailable.

From a BCP perspective, having a short RTO for critical processes while expecting systems to be available later-on is advantageous from a cost perspective and allows business teams to keep the lights on regardless of the current state of underlying systems. However, manual processes to keep the lights on during this outage will require data reconciliation later which has its own set of costs and complications. Business continuity generally is balanced in ensuring that the business can continue to operate during a disaster scenario and preferably with as many systems it needs as soon as possible.

Although systems can be designed to be fully recoverable very quickly, that typically comes at the price of significantly higher operational costs which can be prohibitive for many organizations given the low risk of disaster coupled with the ability to keep the lights on with alternate processes, and less systems, as per a well formed business continuity plan.

Where does DR fit in?

A DR plan is part of the organization’s business continuity plan and exists to support business continuity during a disaster or large outage scenario. DR stands for Disaster Recovery and literally refers to the execution of the DR plan and sometimes generally used as an umbrella term encompassing everything required to make the DR execution happen. A DR design diagram isn’t in itself DR either although architecture and design may be required to support the DR plan. A DR plan has defined processes, procedure, and role definitions used to bring systems back online and restore data within a required SLA.

Inevitably, as an architect, you’re going to be asked about DR. These questions will be familiar to you, “Does your solution support DR?”, “Where is DR?”, “Where is your DR diagram?”, etc. These are valid concerns and valid questions, but sometimes can sow confusion as well.

When people ask an architect for “DR”, what they are trying to understand is how the solution accounts for the technology, software, and processes required to execute the DR plan. They want to know that you’ve done the due diligence to be satisfied that DR will be supported with the design.

DR aims to restore system functionality to support the critical processes identified as part of BCP. System restoration may be staggered and done in phases dependent on the priorities outlined in the DR plan which is informed by the BCP.

A “DR design” or “DR diagram” supports the DR plan and the vision to support the technology required to execute DR, but a DR diagram is not DR nor a DR plan. DR is the actual execution of a DR plan; organizations should be clear on this terminology.

A DR diagram will outline how the solution will support the DR plan, including, where backup and recovery can be employed, rebuild, active-active, geo-replication, and synchronization between data centres will occur in support of DR.

Understanding DR vs BCP objectives

Restoring business processes is not the same as restoring systems, and the nuance here is that systems and applications can have different recovery objectives than the business processes that they support.

RTO is a business continuity objective, and because BCPs can outline alternate business processes in the case of disaster, it would not be a true statement to say that the schedule to bring up an underlying system that supports a business function is the same as the RTO for that business function as outlined in the BCP. This is something that architecture and technology teams sometimes get wrong, and getting it wrong could mean over-building a solution where overbuilding costs significantly more in initial costs and in ongoing operations.

However, if critical business processes are dependent on underlying systems, that dependency is part of the BCP and would then inform the DR plan. This could mean that it was imperative to bring up specific systems within the BCP defined RTO to ensure critical business functions could resume as there is a direct dependency on the system that is defined.

Are High Availability and Backup also DR?

High availability and backup are not DR but they can play a role as part of DR planning.

High Availability (HA): A solution that ensures a high uptime during regular systems operation that typically incorporates redundancy, failover, geo-availability, and capacity management.

Backup: Backup strategy, technology, and processes to facilitate backup and restore functions.

It may be convenient to think that HA or Backup will inherently provide underlying technology to provide DR capability, but DR planning requires much more due diligence.

When we consider DR we have to be intentional about the technology and how that technology specifically supports the DR plan.

Restoring from a backup can be used as part of a DR plan and it often is, but other technology options exist that need to be considered, including, geo-replication of data in real time, as an example.

Building a highly available system could lend itself to DR as well, but it’s important not to conflate high availability (HA) with DR. High availability is an availability attribute but not a DR or BCP attribute. HA will provide a guarantee of a high percentage of uptime during regular operations, but it does not make the solution inherently DR or BCP ready. HA covers normal operations and not BCP recovery scenarios.

While it’s true that if you have active geo-replicated systems, you may be able to continue to seamlessly operate from a different data centre in a geo-replicated environment given a disaster at a single data centre, but there are other considerations such as capacity and availability reduction and impacts to downstream processes. BCP plans will also cover scenarios where the entire system goes down across data centres. For example; a software bug deployed to all geo-repliacated data centres could bring all the data centres down at the same time, so it’s not good enough to say we have DR covered because we are geo-replicated.

Therefore additional consideration is required in terms of how DR will be supported in relation to BCP. Also, remember that the DR plan must be directly supporting the BCP, and the goal of DR isn’t to restore 100 percent capacity nor is it meant to restore all systems that are running during normal business operations. As DR is always informed by business continuity, teams developing DR plans and DR designs work with and are informed by business continuity and the BCPs prioritization of critical business processes mapped to systems.

There will be completely different approaches to DR given different RPOs and RTOs. To achieve a near zero RPO or RTO requires much more careful planning and a much costlier implementation. Conversely, an RPO and RTO of 24 hours affords an organization much more leeway for DR planning. In 24 hours, data can be recovered from backups and new servers and software can be manually provisioned in data centres. Ultimately, these metrics are decided by the business and driven by business continuity planning teams, have executive alignment, and will vary across organizations given the business impact, resource/people capabilities, technology capabilities, and mitigation options available.

Post a Comment

Your email address will not be published. Required fields are marked *