The Importance of Conducting Regular Disaster Recovery Tests
How often do you conduct a disaster recovery test? I’ve asked this question countless times and I never cease to be surprised by the responses. Many people give me an uncomfortable smile before responding, never a good sign. A recent Spiceworks study revealed that although 95% of organizations have a disaster recovery plan, almost a quarter of them (23%) never test the plan. I must imagine that a large proportion of those that have a plan are not overly confident in it or in the people designated to perform it in the event of a test or a real-life disaster.
The reasons why disaster recovery plans don’t get tested is often blamed upon insufficient personnel or time, other more pressing priorities, or complexities associated with the recovery. With teams stretched across multiple disciplines and platforms, along with an increasingly diverse framework, often encompassing the cloud, these challenges are only likely to increase.
When you consider some of the potential repercussions should you have a need to invoke disaster recovery, the excuses for not conducting regular tests pale into insignificance. According to Gartner, the average cost of IT downtime is estimated at $300,000 per hour, depending on the business and how they operate. The inability to be able to service customers or trade can have both an immediate and long-term financial impact. Coupled with modern age dynamics where bad news travels extremely fast, means business reputation can suffer very quickly, taking many years to fully recover.
At some stage the concept of disaster recovery has been discussed and sold to the business, whether that’s in your time or one of your predecessors, and the business quite rightly now assumes that in the event of a disaster normal service can be resumed within an acceptable period, back to an agreed point in time. These areas are defined by the RPO (Recovery Point Objectives) and RTO (Recovery Time Objectives). Recovery Point Objectives are defined by business continuity planning and is the maximum targeted period in which data might be lost from an IT service or application due to a major incident. Recovery Time Objectives are the targeted duration of time within which a business process must be restored after a disruption or disaster to avoid unwelcome consequences associated with a break in business continuity. The RPO and RTO are critical inputs into any solution design when architecting a disaster recovery plan, and without these considerations, plans are likely to be less successful. Equally important are documented Service Level Agreements, an understanding of your infrastructure along with application dependencies, in addition to the flow of data in and out of your applications.
Full testing of the disaster recovery plan should form part of the annual calendar and while there is no hard and fast rule on the exact cadence, it is widely accepted that they should be fully tested at least once, ideally twice a year. For those businesses that must comply with strict regulations such as PCI DSS, HIPPA or GDPR, or those that look to ISO 22301 for their business continuity framework, it may be necessary to test more regularly, including tests following events such as major changes in applications or to key personnel.
You will not be successful in recovering your applications without testing. Regular in-depth managed tests are the only method of unearthing and resolving issues within your disaster recovery plan.
The actual testing of the disaster recovery plan is just one pillar alongside impact analysis, solution design, implementation and maintenance as defined by the Business Continuity Management (BCM) lifecycle model. Each pillar equally important to the ongoing success of the plan and the ability to maintain business continuity with minimal disruption.
How do you know what you don’t know? Or to put it another way, how effective is your disaster recovery plan? While it may look water-tight on paper its not until its tested that you will discover how effective it really is. There are two main elements with a plan; people and your application environment, including the number of interdependencies, neither of which are static with both prone to regular changes. Both will have an impact on the time and the ability to recover the application and to service the business. When you conduct a disaster recovery test you are as much testing a technician’s ability to locate, understand, communicate, collaborate and to be able to follow the plan as you are the actual plan itself. The human element should not be underestimated and can often be considered the weakest link of the plan. If any of these elements goes awry, problems with recovery will surface.
It’s key that tests are conducted properly with no access back to production environments for things that may have been missed for the duration of the exercise. This access would not be possible in the event of a real-life disaster and so shouldn’t be allowed during a test. It should be as real a simulation as possible.
It’s worth remembering that the main reason for performing disaster recovery tests is to unearth anomalies in your plan. For example, your test may be finely tuned and easy to follow, and you might be able to recover all applications in line with expected recovery times, but if your testing unearths delays in the requesting and receiving of 3rd party product licensing keys for your temporary environment then you’ve not fully recovered the application.
Regular testing will result in fewer and fewer issues being discovered, although the business won’t really appreciate this until such time when a real disaster strikes, and the plan is followed. By increasing the cadence of disaster recover tests you instil a new mind shift by making them part and parcel of the standard set of tasks than require completing on a regular basis, which in turn will reduce panic and uncertainty in the event of any real invocation.
The results of the test should be clearly documented throughout. This document can take many forms but should indicate areas of success, where processes were not clear or incorrect and where improvements both in clarity and time can be made. Any exceptions should be documented too; those things that you’re unable to execute as part of the test. This document should be reviewed with stakeholders throughout the business as part of your continuous improvement cycle. As such, disaster recovery plans should be viewed as working documents that improve as they become more familiar over time.
It's worth remembering that the core reason for having a disaster recovery plan and to test it regularly, is to arm you and your team with the confidence that digital continuity can be provided, minimizing disruption to the business. Regular testing proves the plan and ensures that agreed recovery points and times continue to be achievable.
THREE KEY TAKEWAYS
Ensure that your RPO and RTO are clearly defined and that your disaster recovery plan can deliver to them.
Don’t underestimate the impact of human error. Ensure all documented steps are clear and can be performed by anybody at any time, without ambiguity.
Don’t skip any step in the BCM lifecycle as this will have an impact on the success of your recovery.
This article content is written by Ash Giddings, Product Manager at Maxava and an IBM Champion.
Maxava is a global provider of high availability and disaster recovery software for IBM i along with an innovative multi-platform cloud-based monitoring solution.
To learn about Maxava IBM i HA/DR Technical Insight, download the guide.
Maxava is a worldwide provider of innovative monitoring, high availability and disaster recovery solutions. Learn more about our solutions →