Embracing Automation Enables Effective IBM i HA/DR Planning
Automation is a boon for systems administrators. They can, for example, schedule jobs to run at specific times and days, considering issues such as job types and runtimes, differing time zones, end-of-day office closings and resource consumption.
Although not entirely hands off—admins may have to respond to various job alerts—automation has addressed some particularly acute issues, including the increasing dearth of seasoned IBM i professionals, and out-of-office situations, when an expert devoted to a subset of IT function is on vacation or on emergency leave.
In the latter case, somebody else can temporarily step in to oversee jobs and only interact with them when an alert is triggered—and then quickly respond thanks to easy-to-understand scheduling interfaces.
“You can have hordes of people keeping an eye on everything if you want or you can automate it and not worry if a team member is out for whatever reason. The scheduler is looking after the jobs,” remarks John Massey, managing director of RevSoft, which provides a variety of automation and management solutions. “Because of this—and seemingly unrelated issues such as the COVID pandemic, which pushed people out of the office—an increasing number of companies, including smaller ones, have become attuned to the benefits of automation.”
Unnecessary Delays
Indeed, companies that have embraced job automation and management can now do more with less without hampering business-critical operations. If a job were to fail, the outcome could be less than optimal, with, say, the counting and processing of goods sold coming to a halt and interrupting replenishment orders. That job will certainly be rerun, but there may be a gap when the reorder process wasn’t updating, which could impact portions of the supply chain.
More catastrophically, an entire system might go down for whatever reason, with all its current and upcoming jobs along with it. Of course, most companies have disaster recovery (DR) plans in place, but the backup software they’re using, while replicating to the backup box, may not include automated job schedules as part of the process—and if they do, it can be a chore to get them running again. So, while the secondary system might come quickly online, it won’t immediately be running business-critical jobs.
The HA/DR engine running on many organizations’ primary system applies duplicate changes to the secondary system in milliseconds, whether they’re from a journal base or on the object or database level. This allows the target system to take over in the event of a disaster or even during a planned role swap.
That target system is now the production system, but there’s one component that’s missing: the seamless replication of the job scheduler. So, the question becomes, how does one bring that job scheduler up on the proxy server with almost no delay?
As Alex Rodriguez, enterprise sales manager with RevSoft, explains, “If you’re a bank, you’re dealing with and processing teller requests every 15 minutes and also with clearinghouse transfers. But hang on. None of these jobs are running on the proxy server. What’s happening? Well, you may have replicated your scheduler using an HA tool to a different library on the destination server, but when you cut over, you have to stop your current schedules, which has locks on tables and jobs, rename the replicated library as the live library and start it up again.”
A Holistic View of IBM i HA/DR Strategies
Depending on how many schedules are involved, this might take hours to complete—and most companies, especially those with high transaction rates, literally can’t afford to lose that amount of time. Tellers are still working and clearinghouse transfers are still taking place. As a result, scheduled jobs have to resume as quickly as possible, almost as if there hadn’t been a switchover in the first place.
That’s why companies need to take a holistic view of their entire IBM i HA/DR strategies. If complete restores are delayed, it’s costing them money, so the quicker they can get up and running, so much the better for the bottom line. So, in addition to replicating the data on their production servers to backup boxes, they also have to consider their automated job schedules.
“Think of that ship that went sideways in the Suez Canal. It stopped every boat behind it until it was cleared,” Massey says. “How much was this costing every company every second while they were stuck there waiting? How many pounds, how many dollars, how many rubles, how many euros? In a DR scenario—if a system is down for an hour or so—how many orders are being lost? What’s that costing you?”
The ‘Shadowing’ Feature
That’s why it’s important to leverage a feature RevSoft calls “shadowing,” which is available as part of RevSoft’s RevScheduler. Shadowing replicates an “environment” of schedules to the backup system, but in a dormant state so they’re not executed. They’re only activated upon the execution of the RJACTHAV command.
As Massey explains, “Think of an environment as a sports team made up of all the players you have on your roster at the moment. So, with shadowing, I can say move my team over there—like it’s going to an away game—and all the players go with it. In the case of RevScheduler, I can point an environment of 700 schedules to the secondary system without having to worry about each individual job. In fact, anytime you make a change to a job, it’s automatically synced and applied to the other box using shadowing and not via the system replication of HA tools.”
Flipping a Switch
As part of the shadowing process, a verification tool of some sort has to be available, as in the integrity checker that was recently incorporated into RevScheduler. It helps allay any fears that the job scheduler may not have been replicated properly. If something should go awry, users are alerted so they can make corrections before a failover occurs. This type of functionality is also often available in system-wide replication software solutions.
“Nobody wants to be in the wrong place at the wrong time, so what we’re doing is giving users piece of mind,” Massey says. “So, this tool checks the integrity of jobs on both systems—primary and secondary—in real time, not five minutes ago, not five minutes after that. It’s checking them now to make sure you’re good, because you don’t want to just sit back and say, ‘Yeah, it’ll be fine.’ You want to know you’ll be fine. So, we consider the integrity checker a seat belt of sorts, keeping you safe even if a wrong-time-wrong-place scenario does happen.”
Which they do, as an Illinois-based bank recently discovered. It had run a planned failover test to a backup server on Saturday, October 10, 2020, and then switched back to its production server on the night of Sunday, October 11. Everything worked perfectly, as planned.
But on the following Thursday morning, on October 15, the production server’s disk controller unexpectedly failed. This wasn’t a test anymore, and the bank had to failover to a backup system live. But how long would the automated scheduling cutovers and the cutbacks take?
“They came in at 0.108 of a second on the cutover execution time with 100% accuracy. We’re talking about quicker than you can type in ‘rename object’ and press F4,” Massey says. “They had 159 jobs, which isn’t as many as other companies, but in HA mode, you’re all the same boat. It’s costing you money. So, the quicker the bank could cutover, so much the better. After the disk controller had been fixed, the cutback execution time was 0.062 of a second. Throughout all this, they didn’t have to rename or change anything. It was basically like flipping a switch.”
Accounting for Everything
There are many reasons why IT should embrace automation, one of the most common of which is probably simply making things easier to manage. This notion should also apply to HA/DR.
Most system-replication solutions do what they’ve been built to do—and do it well—but they may omit or downplay certain critical system functionality, such as job schedulers. Unfortunately, this can result in additional failover effort and increased failover times, both of which can potentially leave money on the table.
But with a thoughtful HA/DR plan in place, that doesn’t have to be the case. This is especially true now that automation, including automated job schedules, has become increasingly vital to business operations, no matter the size of the organization. Everything must be accounted for and become part of the larger DR/HA design, and nothing can be left to chance.
For more information, see “REV SCH HA Overview” and “REV SCH HA Integrity Checker.”