Capacity Planning and Management: Don’t Let the Tank Go Dry
IT computing complexes are fluctuating, growing, evolving entities, characterized by continual, erratic growth, intertwined with shifting business volumes. The long-term trend is continual system growth driven by business volumes, as well as new forms of usage. A daunting challenge for most IT operations is to have a “crystal ball” that predicts IT resource demand growth, then matching the demand with hardware, software and technological advances. Once that’s been done, a plan and schedule needs definition and implementation, staying one step ahead of performance degradation. This is a simplified definition of what capacity planning and management is.
As I pondered this topic and article, I recalled a binder I still have in my basement, and I wandered down and dug it out. The large, white, capital letters “HAL” dominated the binder’s back, bringing a smile to my face, for the term hearkened back to the movie “2001: A Space Odyssey.” The onboard computer was called “HAL,” a play on “IBM” by shifting each letter one to the left. In our case, it stood for “Hardware Addition List,” a document created and maintained by a large insurance company’s fledging capacity management committee. It was comprised of several managers and IT specialists and a couple top-notch IBM system engineers—plus me, the rookie.
This client had just converted numerous batch systems—customer service, policy sales, marketing, etc.—to online access across the country, which greatly expanded their reach and sales. Especially in those days, providing consistent, decent response time was an intimidating challenge, and providing the necessary computing power was a game of frequent buy and replace. It was haphazard, more guesswork than organized, and the client enlisted IBM’s help to get their arms around the process. I feel fortunate to have been part of the undertaking.
While capacity planning and management is much more sophisticated these days, the fundamental functions were as germane then as now.
It Starts With Data and Patterns
We were initially at a loss as to how to start, but a systems programmer pointed out that the system itself could help us plan and project:
- He pointed out there was a readily-available source of information on past performance that could provide the detail needed to project future performance trends. That alone wouldn’t be enough, but it was a sound base on which to build. While this was in the early 1980s, MVS already collected System Management Facilities (SMF) data, and Resource Management Facility (RMF) data recording was either available or soon would be (I can’t recall); those were great sources of performance and capacity data.
- Financial reports revealed that some aspects of business transacting ebbed and flowed based on certain annual periods based on holidays, promotions, year-end, etc. Department business measurements were also enlightening.
- A number of IBM software products, like CICS/VS and VSAM had begun collecting performance and usage statistics, so we collected that data, too.
- A part-time position was created and staffed by the committee to serve as an upper management interface who informed the HAL committee of new products, promotions and future plans. This facilitated merging a business growth estimate into IT hardware and software growth so system expansion could be planned, not reactionary.
- The last category of information that was needed, and fell to us IBMers, was product performance metrics. What was the processor cycle speed? Disk/tape capacity and transfer speed? Program path length? We got the answers to those questions, and established contacts in IBM labs that specialized in capacity and throughput concerns.
- Lastly, software and hardware tuning were recognized as a vital component to increase capacity, and that’s where I came in. Certainly it had its limitations, but it was a great short-term fix, a technique to keep things operational until more resource could be installed. Tuning is a process to make system and application adjustments resulting in optimum performance. Tuning is also about tradeoffs, and that’s where a specialist has greatest impact. Many products are guided by parameters that control real storage, virtual storage, processor cycles, data transfer speeds, data placement on disks and network bandwidth, plus lesser resources. Tuning is a powerful capacity management tool.
Establishing the Platform
Having now collected or created the necessary components to manage capacity, it was time to take action:
- The first step was to establish the infrastructure—assigning responsibilities, procuring management’s approval of various positions, defining members’ roles, providing any possible training, establishing regular meeting schedules, determining contacts in departments and management positions, defining implementation procedures for functions now elaborated, etc.
- The next step was to establish a current performance baseline—defining system characteristics and tendencies, determining performance and capacity procedures and programs, running benchmarks to investigate performance characteristics and normal operations currently defined, determining types of reports and when to run them, and other committee tasks.
- The third step was to initiate committee activities, tailor activities and deliverables based on running them, solicit feedback on activities and modify operation to meet organization’s needs, and establish regular operations.
Go to Work
For the first year, the HAL committee—a limited, part-time assignment—spent much of its time building databases of information. SMF and RMF archives were created, and from that historical patterns of resource consumption were developed via hand-written programs and segmented into line-of-business versions. Upper management, seeing an opportunity to manage expenses, embraced the concept and supported the effort. A tuning study was performed and related methodology was implemented (the inspiration for the IBM Tuning Checklist I wrote and the original IBM CICS Performance Guide I co-authored); it had quickly paid dividends.
It wasn’t long until the concepts HAL devised got put to work, because the mainframe’s contract was nearing completion, and while the processor was still handling the load, overall resource consumption was nearing saturation levels—a number the client had never before been able to measure—allowing them to make an informed decision. Based on projected business volumes and transaction estimates, the processor capacity would be completely exhausted within a few months. The client decided to upgrade to a newer, faster and proficient processor, a decision that proved correct and avoided the “dead stints” that had plagued previous years.
Know Your System
An interesting side effect occurred as a consequence of executing the various functions and analyses we’d built into HAL’s design; we gained much knowledge on the IT system’s workings, due to the ongoing research and hands-on work that came with establishing and implementing the undertaking we’d created. Our efforts resulted in a much deeper understanding of how IT components worked, interacted with each other and reacted to different conditions.
Experience is irreplaceable. Every system has its quirks, and knowing what they are, often via long nights or weekends of debugging and experimenting, using trial-and-error techniques that work even though you don’t know why, are precious gems of knowledge. Building that database of trends, tuning tips, stress testing, recovery, data reorganizations, delete-and-restores, even simple things like power down and up, they’re tried and true solutions gleaned from experience.
A lot more can be done with today’s products, such as:
- Automated operations
- Advanced analysis tools
- New hardware and software capabilities
- Expansion into the networking aspect of IT
- Continuous capacity planning
- Workload scenarios and performance simulation
- Workload analytics
- Dynamic capacity optimization
- Peak period optimization
Capacity Planning: An Ongoing Payoff
Was the HAL capacity committee primitive and cumbersome? Most definitely, and today there are numerous products much richer in function with vastly more sophisticated features. Yet the basics persevere: Develop a performance and consumption historical database, identify patterns and trends, interface and educate IT and upper management, and establish relationships with hardware, software, and capacity analysis vendors for usage guidance, performance metrics, and tuning tips.
Regardless of processor size, associated peripherals, organization or applications, the efficacious implementation of capacity planning and management is a must. For us, HAL members quickly evolved to add charts in tabular reports, run stress tests to gather information at different transaction rates, and build a compendium of test cases that reflected real-life scenarios. Service Level Agreements (SLAs) were implemented to provide hard data on performance levels we needed to achieve, and as we matched simulated performance against specific response targets, it became clear what the computing complex size and power needed to be. We had arrived.