May 5, 2020

The case for testing your Disaster Recovery System

Written by Tyler Moore

Disaster recovery infrastructure, software and services are what organizations rely on when everything else around them fails, and right now, most organizations are not making sure these important capabilities will consistently work when needed.


Cloud-based DR has grown in popularity the last few years as companies have looked for lower-cost alternatives to running a conventional DR set up at a remote datacenter or colocation center housing replacement systems. Most organizations have a disaster recovery solution in place to provide protection in the event of a disaster, however, a large portion of organizations are not testing to ensure that these capabilities will actually work when they are needed. With the rapid changes that are being made in production environments, this lack of testing could cause failures to occur when a disaster recovery failover operation is triggered. Disaster recovery testing continues to be a major challenge for organizations, and it’s an area where vendors and service providers are adding automation and proactive management capabilities to accelerate and simplify testing to ensure that recovery operations will run smoothly when the time arrives.

The Problem: Most organizations are not testing DR enough

There are many studies on disaster recovery and what they all make clear is that most organizations are only testing once a year, or not at all.

Several surveys found that organizations that were mostly automated with manual exception handling were more likely to test twice a year or more compared to organizations who described their infrastructure management as manual with limited automation tools.

Regulated industries such as financial services and healthcare tend to test more than others with 53% of financial services organizations testing at least twice a year, and 35% of healthcare doing the same. Similarly, larger companies are testing far more than their smaller counterparts. Nearly half of companies with over 10,000 employees test their DR at least twice a year, in contrast to just 30% of companies under 250 employees.

End user perspective and behavior

The lack of DR plan testing reflects a common sentiment that the testing process continues to be extremely time-consuming and manual, which is why this task often gets relegated to the back burner, especially if infrastructure professionals are already struggling to keep up with current business requests and requirements.

DR is not just about technology and automation, it is also about human process, and you cannot discount the importance of internal knowledge of business processes and the experience of staff members managing the workloads.  It is a key requirement that IT organizations invest time in updating documentation to ensure that teams will be able to run the DR plan efficiently even if the staff that built the plan is no longer at the company or in a different role. Lack of internal documentation is a key problem, and many vendors have created software and services to help customers create their runbooks and other key assets to fulfill their compliance requirements.

Cost is often brought up as a factor for not deploying a comprehensive DR plan to cover all workloads, but more typically it is just not top of mind for organizations.

RStor’s Recommendations:

Integrate DR planning into automation initiatives. 

Organizations that have invested in automation are able to test their DR plans more consistently, which should make their implementations more reliable in the event of a disaster. The testing process for DR continues to be arduous for most organizations, which explains why so many are not testing enough or at all.

Cloud-based DR has an elastic resource benefit. 

One of the key benefits of a cloud-based DR implementation in contrast to running DR at a secondary site or a colocation site is the elasticity of cloud. In a cloud-based DR deployment, the bulk of the resource consumption for compute, storage and networking services does not occur until a failover happens. In contrast, traditional environments for DR had matching or similar systems at the failover site, which was a major expense.

Keep staff availability in mind. 

In the event of a major disaster such as a hurricane or earthquake, it might not be possible for staffers to go to a secondary site to manage the failover process and those resources. Remote management and security will clearly be essential in a DR scenario, but this is also an area where a cloud solution or a service provider could be valuable to manage the process and ensure that the replacement resources are running smoothly.

You may also like…