Skip to main content

How to Automate an Entire Software System

One of the most challenging project is automating systems by using a thoughtful method to automate the start up, shutdown and recovery of an entire software system.

I studied to be a computer application developer. My interests were halfway between business and computers, so developing applications were a pretty good fit. Besides, I didn’t know anything about the system side of things—OSes, middleware software, job entry subsystems and system software installation—basically the tasks and concerns of the system programmers. After five years of being an application developer, I was invited to join the system programming team where I stayed for a long time enjoying the many challenges of the work. One of the most challenging kinds of projects was automating systems by using a thoughtful method to automate the start up, shutdown and recovery of an entire software system. Let me explain.

System Automation Projects  

Originally, mainframes were operated almost entirely by human operators. The people who did this work, console operators, monitored activity at the system console in the computer room and responded to action messages like mounting tapes, supplying run-time information to applications and handled exceptions as they arose. At a point in the late 1980s, companies started to examine the normal operation of large-scale systems and started to make changes. Part of the motivation was to change the way these systems were run to manage the exceptions not the normal operation (which could be handled by software). Another motivation was to save labor costs on computer operators by using fewer of them.

System automation projects usually started the same way: by identifying project goals, selecting tools and team members, and creating a plan. A typical early task was identifying messages to be suppressed from the console. Over the years, different log analysis  tools have been supplied to help figure out what messages to suppress. The mechanism to suppress them was to put the message ID in a message processing facility member in SYS1.PARMLIB and use the SET command to make the list active. You can imagine the before and after view of the console—it was shocking to see the difference. Operators on the team asked: how do we know the system is running if we don’t see job started and ended messages? Usually a system monitor tool was used that showed jobs in the system, CPU busy, disk I/O rates, etc. This was more useful than using messages to show activity.  

After messages were suppressed, some programming usually needed to be performed. Automation candidates might include trapping an action message and providing an automated response or grabbing an information message and issuing a series of required commands. This was a piecemeal way to do it—find a limited set of situations—and “automate” them. This was however a useful as a way to show success. Many used the NetView product for this kind of work because it supported a CLIST language (and later Rexx), had a message table to use for trapping messages and taking actions, had easy to use timer facilities (AT, EVERY and AFTER), could be used to issue OS commands, and came with various automation samples. Today, IBM has automation products that use NetView as their automation engine.

Tools Gave Way to Products  

Starting in the mid 1990s the tools approach gave way to the use of comprehensive system automation products that took a top-to-bottom approach to the challenge of automating a complete system. These tools operate in this way:
  • At the earliest possible time in the IPL, the automation software takes over the IPL process from the operator and starts all software in the order specified in a main system definition file.
  • All resource manipulation done by operators, stop a subsystem or restart one is done using the automation software console and supplied commands.
  • Automation software handles all recover actions. Software companies supply recovery definitions to be administered by the automation team and there is always the possibility to add your own company-specific definitions and actions.
  • Systemwide shutdown (or restart) is handled by a single command. The software performs the shutdown based on the hierarchy supplied by the automation team. For example, software requiring the use of the network would be shut down (and complete shutdown) before the network was shut down.
Today, in the discipline of IT service management, there are numerous systems and workload automation products from IBM. The transition from build-your-own automation to implementation of a comprehensive product has reached quite a level of excellence, including:
  • IBM Automation Control for z/OS
  • IBM Cloud Orchestrator
  • IBM Infrastructure Suite for z/VM and Linux
  • IBM OMEGAMON Performance Management Suite for z/OS
  • IBM Service Management Unite
  • IBM Service Management Suite for z/OS
  • IBM System Automation for z/OS
  • IBM Workload Automation
  • IBM Workload Scheduler for z/OS
  • IBM Workload Scheduler
  • Tivoli Asset Discovery for z/OS
  • Tivoli NetView for z/OS
  • Tivoli Output Manager for z/OS
  • Tivoli System Automation Application Manager
  • Tivoli System Automation for Integrated Operations Management
  • Tivoli System Automation for Multiplatforms
  • Tivoli Workload Scheduler LoadLeveler
From this list, you can see the OSes (i.e., z/OS, z/VM and Linux) and workloads (i.e., console and batch jobs). The disciplines vary as well, for example performance and operations management. Automation that started in large-scale systems has sprung up all over IT. Gartner writes of the role of IT process automation tools and the need for careful management.

Next Week

Next week, I will explore something else they didn’t teach me in school—how to create an offering—a combination of hardware, software and services—that can be deployed to a worldwide set of clients.
Webinars

Stay on top of all things tech!
View upcoming & on-demand webinars →