Understanding Disaster Recovery Responsibilities When Using the Cloud

Disaster Recovery Responsibilities When Using the CloudIn the wake of recent Cloud Service Provider (CSP) outages, what is your organization responsible for when it comes to complex IT architecture?

Many organizations today rely on complex IT infrastructure to support their operations, leveraging solutions ranging from internal hosting to cloud hosting to dependence on third-party systems. IT service delivery is getting more intricate, in large part due to the need to leverage different IT tools and services from a variety of providers. Cloud-based solutions, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS), promise simplicity for the end user.  However, IT service delivery and management usually becomes much more difficult due to the complexities around architecture and integrations. Therefore, IT disaster recovery planning becomes more difficult, as it must account for these complexities and coordinate with various third parties to ensure adequate coverage. Bottom-line – simply defining who is responsible for what when it comes to disaster recovery planning can be difficult.

Information Technology Disaster Recovery (ITDR) managers are tasked with orchestrating and managing ITDR across the entire landscape of hosted solutions. At first, this may not seem too daunting, as it’s easy to think of SaaS and other cloud-hosted systems as “someone” else’s responsibility. However, over the past year, we’ve seen the world’s best cloud service providers experience downtime. The Amazon S3 service disruption on February 28, 2017 made nationwide news, even though the total downtime was less than six hours. Last October, dozens of popular, frequently-used websites were unavailable after hackers unleashed a DDoS attack on the servers of a major DNS host. The most recent and widespread ransomware attack forced many companies to rely on (or establish on the fly) workaround procedures for critical systems. Hundreds of organizations were impacted in some way by these outages.

Depending on the level of integration and the degree of interdependency between operations and the cloud-hosted systems, the impact of a cloud service outage could be exponential.  Consider the broad spectrum of potential impacts that can result from a cloud service disruption:

  • Lost sales
  • Revenue loss or cash flow issues
  • Operational delays or backlogs
  • Regulatory non-compliance
  • Reputational damage
  • Life Safety
  • Other customer-facing impacts

So, the question becomes:

How does an organization leverage highly-capable, secure, and convenient cloud solutions, while ensuring that their organization is protected from the potential risks and resulting impacts that accompany unpredictable CSP downtime?

IT and ITDR managers are increasingly responsible for understanding the IT landscape holistically and breaking down the traditional silos between various solutions to create a comprehensive ITDR program. ITDR for internally-hosted solutions is the responsibility of the organization and SaaS DR is the responsibility of the vendor (although the organization must validate that ITDR is in place and address integration with other systems). However, responsibility when it comes to cloud solutions are far more unclear. So, for this article, let’s create clarity around what ITDR responsibilities look like for cloud-hosted solutions, and where business continuity comes into play.

When it comes to ITDR planning, the segregation of responsibilities between an organization and a CSP are, in theory, a contractual agreement, not just a logical or technical assumption. However, in reality, the responsibilities can be rather murky. The following matrix provides an overview of where responsibility traditionally lies when it comes to CSP agreements:

CSP Agreement Matrix

Once you understand where responsibility lies from an ITDR perspective, it’s important to understand that it is impossible to completely mitigate the risks associated with cloud systems and hosting: this is where business continuity planning comes in. Deferring the responsibility of a system does not defer the risk, so contingency planning must be in place. How do you actually manage the downtime when it occurs? During the AWS outage, organizations that had systems replicated in two different regions were largely safe from disruption. So, in advance, consider the geographic diversity provided by your chosen hosting solution.

It is equally as important to establish and integrate manual workarounds. Considering the history of cloud service disruptions, manual workarounds for cloud-hosted systems are rarely used, but absolutely critical. While having detailed manual workarounds for every cloud-hosted system in use is ideal, we understand that it may not be practical. So, we recommend that you focus on documenting workarounds and alternate procedures for the applications and systems that are needed to support the continuous delivery of your in-scope, critical products/services. Furthermore, established manual workarounds for critical systems should be tested and socialized throughout the organization, making the switch from the application to the manual procedure as seamless as possible during a third-party disruption. Of note, most CSP outages have lasted less than 24 hours. Establishing a short-term workaround can keep operations afloat while downtime occurs. Alternatively, identifying backup systems or applications that have the same or similar capabilities to your critical systems can aid in continuity efforts during a longer-term disruption.

The bottom line is that contingency planning for cloud-hosted systems may seem tedious, but it’s integral to an organization’s ability to recover. Focused efforts can ease the burden when performing this planning. Business continuity isn’t about building a workaround for every application, but instead about identifying what is truly needed to continue operations and developing contingencies for those tools. From there, awareness plays an important role – awareness of the capabilities and SLA in place with your chosen CSP, awareness of the workarounds and alternate procedures viable in the event of a disruption, and awareness that just because your organization has chosen to outsource the service to a CSP, the risk has not been outsourced.

The growing diversity of IT architecture has simultaneously created more resilient DR capabilities and more risk when it comes to integrating and protecting infrastructure. Each solution – internal, SaaS, CSP – is a solid option on its own, and may be even better when skillfully integrated with other solutions. However, when an organization elects to diversify and use multiple solutions, integration is complex and must be addressed both in DR planning and in the inherent downtime planning.

Business continuity and IT disaster recovery planning is all that we do. If you’re looking for help with building or improving your business continuity program, we can help.

Please contact us today to get started. We look forward to hearing from you!


Rose Reilly
Avalution Consulting: Business Continuity Consulting