Overcome HA/DR Challenges on IBM Power Systems
IBMers Steve Finnes and Steve Sibley explain the value of designing and testing an HA/DR strategy.
Image by Romain Trystram
Do you know how much an unexpected outage of one of your mission-critical servers would cost your company? If you aren’t able to give an exact hourly rate, don’t feel too bad. In an independent survey published by Information Technology Intelligence Consulting (ITIC) last March, 55% of respondents said that they couldn’t either (bit.ly/2pt4o1f). When queried further, only one in five respondents was actually able to “accurately assess the hourly cost of downtime and its impact on productivity and the business’ bottom line.”
Putting a Price on Downtime
It’s not surprising. The cost of downtime is difficult to calculate. If a manufacturing plant halts production because a key server fails, calculating the loss of productivity could be straightforward, but if a bank’s customers are unable to transact when they want to, how do you decide what price you put on brand and loyalty?
“People often haven’t done that exercise,” says Steven Finnes, IBM Power Systems* high availability (HA) offering manager. “They look at the investment cost to have HA and disaster recovery (DR) solutions, but they don’t look at what it would cost to be down for an hour or more at peak production time.”
The chances are that the costs are higher than you think. Of those in the ITIC survey who could calculate the cost of downtime per hour, a quarter said it was between $301,000 and $400,000. One in three estimated the cost at $3-5 million.
“Software can fail, or a disaster might hit the data center. Organizations need to plan for application and system availability even when these failures occur.”
Planning for Problems
The cost of downtime isn’t just about avoiding unplanned outages, adds Finnes. HA also means preparing for the planned disruptions necessary to keep an IT infrastructure current and able to leverage new capabilities.
“If you don’t have a modern HA solution, how much does it cost the business to take the production servers offline to do hardware and software maintenance? And what’s the cost from a human resource point of view?” he asks.
According to ITIC, almost all service-level agreements today demand “four nines” of availability (i.e., 99.99%) or more to servers within the data center. In other words, HA is a combination of choosing the most resilient hardware and designing the data center operations for continuous availability, eliminating (or at least minimizing) the impact of both planned and unplanned outages.
DR, on the other hand, is the ability to restore service when a critical failure or natural disaster impacts access to the data center itself. It’s one of the fastest growing sectors of IT spending and is increasingly popular as a cloud service add-on for solutions.
“What I’m hearing from clients is that workloads have become 24-7, 365 days of the year. No business can afford for applications to go down. They need to be available at all times, both for customers and for internal staff and processes,” says Steve Sibley, vice president of offering management, IBM Cognitive Systems Infrastructure.
There’s still some education to do, Sibley continues, but as IT has become more central to all aspects of business in all sectors, handling everything from supply chain management to medical records, the need for solutions that deliver continuous availability has become evident. Pockets of resistance exist where HA/DR solutions are seen as grudge purchases on top of new solutions and corners are cut—but they are becoming fewer and farther between.
“The biggest impact on availability awareness is when an outage occurs and a system is down for a day,” says Finnes. “There’s nothing like that type of reality check for people to realize that what they thought was an HA solution is not.” That even goes for Power Systems clients. The focus on designing the Power Systems platform to be the most reliable and resilient systems in the world is “maniacal,” says Sibley, and the hardware is famously robust. But no system is immune to failure, and in ITIC’s survey, 59% of respondents report that human error had the greatest negative impact on server reliability.
Sibley tells the story of one client whose IBM Power Systems server ran for nine years without a single issue—but when something did eventually go wrong, it created chaos because the company was completely unprepared. “There’s always a chance that a system will go down,” Sibley says. “Software can fail, or a disaster might hit the data center. Organizations need to plan for application and system availability even when these failures occur.”
“Many businesses need to have a recovery point with zero data loss. But you can't replicate data synchronously to the application state over a great distance. Therefore, you need to consider a solution that enables both a recovery point of zero and also enables geographic dispersion.”
Balancing Data Loss and Expense
Strategies and solutions that deliver continuous availability are evolving. Gartner recently predicted that 40% of businesses will replace their current backup applications by 2022, as more complex environments with big data capabilities become more common. Even with some of the most resilient systems in the industry, IBM is delivering new solutions to simplify HA/DR and address these new challenges.
The most comprehensive solutions see data and infrastructure mirrored in multiple data centers and “zones of availability.” Minimally, this means ensuring that they don’t share a single power grid, but it can also mean thinking about geography so that two data centers cannot be affected by a single natural disaster.
“Many businesses need to have a recovery point with zero data loss,” says Finnes, “But you can’t replicate data synchronously to the application state over a great distance. Therefore, you need to consider a solution that enables both a recovery point of zero and also enables geographic dispersion.”
Given the expense, a need exists for pragmatism and careful risk evaluation of individual systems and parts of a system. Understanding the impact of different potential points of failure will help managers decide what level of protection they need.
“We do have companies that accept that they can lose a few seconds of data and settle for a two-site solution,” Finnes says, “A bank, on the other hand, might have three sites. Two on the same campus that are synchronously connected and a third many miles away.”
Regular Testing Is Crucial
The biggest challenge, Finnes continues, isn’t convincing people that they need a DR solution and an action plan for when things go wrong. It’s testing those plans regularly and making sure that the entire failover process works. “The best practice is to execute failover operations regularly to validate that everything is working,” Finnes says. “In some cases, organizations will switch operations quarterly between their main site and their secondary site, and stay there until the next drill.”
In some industries, such as banking, regular DR testing is mandated by regulations. Too often, however, plans aren’t regularly tested, and organizations that think that they have a working strategy find that it doesn’t deliver when needed.
Keep it Simple
Sophisticated tools allow organizations to be more holistic in their approach to recovery strategies, says Sibley. But the fundamentals still need to be right. Server management tools that include
HA/DR are improving all of the time. And Sibley points out that IBM has overhauled its HA offerings over the last three years: updating PowerHA* high availability; introducing VM Recovery Manager, which automates restarting VMs for both HA and DR environments; and developing IBM Db2* Mirror for i. “This enables us to extend a whole new level of functionality to clients,” Sibley says.
The key to HA is taking a systematic, comprehensive approach to an organization’s IT infrastructure and then automating as much as possible, says Sibley. “Human error is often the challenge that leads to needing modern HA/DR plans,” he says, “But it can also be the problem when implementing them. The more you can automate, to minimize human error, the simpler and more robust it is.”
Adam Oxford is a freelance writer based in South Africa. He’s covered technology-related issues for more than 20 years.
See more by Adam Oxford