Skip to main content

IBM CA/DR Solutions on Z Combat System Availability Threats

For a growing number of businesses, five nines is not enough availability. Operational and regulatory requirements are forcing a shift from high availability (HA) to continuous availability (CA). Cyberthreats add to the pressures to protect and quickly restore critical data. In order to maintain operations in the face of multiple challenges, it’s imperative that we understand and manage the challenges and the complexities of CA and disaster recovery (DR). IBM offers a comprehensive and customizable set of solutions to help clients keep their IBM Z* mainframe operations running.

Simple But Not Easy

The concepts of CA and DR are easy to explain but complicated to implement at the scale of enterprise business. David Petersen, IBM Distinguished Engineer and chief architect for GDPS, explains CA in an elegant way. “CA has two parts. There’s HA to mask unplanned outages. And there’s continuous operations (CO) to mask planned outages.” When HA is no longer available, Petersen explains, we seek DR. Regulatory pressures from governments, operational pressures from stakeholders and pressure from customers are creating a business climate with little tolerance for less than 100% uptime. Additionally, social media informs the public of outages, so it’s no longer just customers who are aware of situations. This increased level of awareness of IT events is harmful to company reputations. These pressures are compounded by the realities that maintenance windows are shrinking, and security threats from inside and outside the company are increasing while IT budgets remain tight and the speed of business continues to rise. In the face of ever-changing applications, downtime is an expensive option. CA/DR, especially in multiserver environments, calls for very sophisticated solutions. Furthermore, these solutions need to be as automated as possible; introducing human intervention in a DR process leads to delays in assembling the recovery experts to restore systems and applications, which can jeopardize service-level agreements. Requiring human involvement also introduces the risk of error. That risk is compounded by the stress of dealing with an outage.

The Need for Sophisticated Solutions

Nick Clayton, IBM Distinguished Engineer, Enterprise Storage Development, sees three major trends that are driving the need for ever-increasing sophistication in CA/DR solutions:
  1. The time to recover from outages is shrinking. Businesses are expected to focus more on CA. Keeping services continuously running can be more challenging than tolerating outages.
  2. It’s no longer just banks and other financial institutions that need high uptime. All industry segments are seeking to improve CA/DR because society has become more dependent upon IT and the toleration for outages is decreasing. Additionally, once high-end solutions are becoming more mainstream. For less sophisticated users, the solutions need to hide their complex inner-workings; they need to be easier to use and they also need to be more automated.
  3. The focus on being able to recover quickly from logical data corruption and destruction has increased due to growing concerns of ransomware and other cyberattacks together with the fear of insiders destroying sensitive data. Recovering from data corruption events is complicated because the recovery process involves reassembling a corrupted workflow consisting of applications, storage, their individual states and their interactions, potentially across multiple servers.
In light of the growing number of cyberattacks and attack vectors, we’re reminded that data protection involves more than having that golden copy of data along with multiple geographically dispersed copies. A growing aspect of CA/DR entails having multiple, and potentially frequently taken, snapshots of live data together with the tools and expertise to identify data breaches, determine when the data was logically corrupted or destroyed, and restore the most recent clean data. We’re needing to revisit traditional models of CA/DR that didn’t give sufficient attention to logical corruption. That’s the lesson of cyberresiliency.
“In general, the IBM servers, z/OS and IBM middleware all provide a migration pathway wherein, so long as you're not too far behind (usually N-2), you will be able to make these rolling upgrades with no interruption to the ongoing workload.”
—Mark Brooks, senior programmer, IBM Parallel Sysplex

IBM GDPS

The drive to never go down is increasingly not being satisfied with the traditional model of having a single production and a single standby DR site. Glenn Wilcock, senior technical staff member, IBM Systems, explains that businesses need to understand important concepts such as recovery point objective (RPO), recovery time objective (RTO), backup window objective, network recovery objective and the typical achievable RPO and RTO for a number of DR options. All organizations with mission-critical IT, especially those with regulatory pressures, need to ask themselves how long it would take to recover their data and how much data they can afford to lose. Beyond DR, organizations must consider how much time they have for planned maintenance outages. For a growing number of organizations, having a basic two-site DR configuration with limited CO options exposes them to more risk than they can manage. IBM GDPS is a suite of solutions that automates the CA/DR process and scales to three- and four-site configurations (ibm.co/2IqChZ8 ). GDPS manages the complexity of providing CA with options for synchronous and asynchronous remote copy (depending on the distance between sites) to meet RPO and RTO business objectives. While having dispersed sites provides recoverability in the event of a site outage, it also introduces data transfer latency along with the risk that a small amount of data will be lost during a failover across sites. Petersen recommends that enterprises refer to an IBM Redbooks* publication titled “IBM GDPS Family: An Introduction to Concepts and Capabilities,” for a detailed overview of GDPS (ibm.co/2N7jWiA). The publication introduces IBM data replication technologies along with different GDPS offerings. Chapter 1 on business resilience is a worthwhile read for those new to the CA/DR conversation, particularly Section 1.3 on IT resilience. In that section, Figure 1 lists typically achievable RPO and RTO for six common DR options. Understanding the table and the related discussion in the text will go a long way toward helping organizations comprehend the challenges and solutions surrounding CA/DR.

IBM Parallel Sysplex

The IBM Redbooks publication also introduces IBM Parallel Sysplex clustering technology as the foundation upon which GDPS is built (ibm.co/2IgQzvF). David Surman, IBM Distinguished Engineer and the design and development lead for the Parallel Sysplex components in z/OS*, explains, “A Parallel Sysplex is a shared-data cluster of z/OS systems intended to provide HA and scalability for client workloads. It does this by providing redundant resources (servers, systems, middleware subsystems, application instances, network connections, etc.) along with workload routing and data sharing capabilities so that the client’s workload can run on any z/OS image in the sysplex.” Surman further notes that Parallel Sysplex provides seamless scaling and availability for workloads through planned and unplanned outages, and also in the face of capacity constraints. Mark Brooks, senior programmer, IBM Parallel Sysplex, adds to Surman’s introduction. “Achieving CA requires more than a redundant infrastructure,” he says. “There is also work to do in terms of how you operate things.” Clients need to perform rolling initial program loads utilizing a strategy that allows for upgrading one component at a time (whether it be hardware, application software, middleware or OSes), yielding a mix of old and new components that can co-exist for a time. “In general, the IBM servers, z/OS and IBM middleware all provide a migration pathway wherein, so long as you’re not too far behind (usually N-2), you will be able to make these rolling upgrades with no interruption to the ongoing workload,” he adds. The GDPS CA (formerly GDPS/Active-Active) solution maintains CO across sites separated by arbitrary distances. In this solution, a Parallel Sysplex can fail over to another Parallel Sysplex to provide near-continuous availability in the event of a site failure. Petersen notes that GDPS CA uses software replication and achieves an RTO of 30 seconds or less. He contrasts this with the larger RTO of an hour or less that z/OS Global Mirror (formerly Extended Remote Copy) and IBM Global Mirror achieve with GDPS using disk-based replication.

Simplification Is Key

While some clients require the sophisticated capabilities of various GDPS offerings, others don’t if substantial increases in RPO and RTO to 48-72 hours for a tape restore are not concerns. Wilcock explains several simpler DR options. The basic solution of taking a full dump of their environment as a point-in-time flash copy and saving the data offsite is the beginning of a DR plan if CA isn’t a requirement. That solution can be refined using DFSMShsm, which provides automation for backup, recovery, migration, recall, DR and space management functions in DFSMS, the storage management function of z/OS. DFSMShsm uses Aggregate Backup and Recovery Support, which identifies data associated with an application and makes it possible to prioritize the recovery of individual critical applications. While IBM solutions for CA and DR manage a tremendous amount of complexity associated with keeping critical infrastructure running, planning for and implementing these solutions requires significant skill and experience. IBM offers a few services that can help manage the complexity. IBM Resiliency Orchestration helps simplify DR management to reduce risk and improve availability. The service can also help with cyber incident recovery. IBM Business Continuity Consulting Services, a complementary offering, helps clients plan for potential continuity incidents. Together or separately, the two services can improve RPO and RTO while simplifying and accelerating DR processes. Whatever approach IT uses to ensure maximum uptime, the trend is that CA/DR solutions are becoming more sophisticated and more comprehensive to deal with increasing threats to data and to system availability. These solutions are also evolving to abstract away much of the complexity of their implementations. Today’s solutions offer more choices, increased capabilities, greater sophistication and more automation. They’re also easier and quicker to deploy and manage. Together, these are good things for the bottom line.