Mainframe Insights: Storage and Performance
Legend has it that when Michelangelo was asked how he managed to sculpt the statue David out of a block of marble, he responded that he chipped away at everything that did not look like David. Legend notwithstanding, this is a simplified description of Kyndryl’s approach to managing systems. Only instead of marble, we concern ourselves with data. The trick is in knowing what to chip.
Without data there are no systems. Everything is built on top of data; this is an obvious statement. What’s less obvious is the total cost of that data. Poor data management can lead to performance and capacity problems—and bury data in the fog of complex systems. There are business reasons why data can be toxic and carry risk. And systems can be surprisingly inefficient and costly. The good news is there are indicators that not only flag inefficiency, but also serve as proactive warning mechanisms.
We design our assessments to standardize the rollout of best practices. Our team works with each client to build a database of information from a long list of tools like IDCAMS DCOLLECT, RMF, CMF, SMF, zBNA (Batch Network Analyzer), monitors and others, and then expose that data to analytics in the form of programs and subject matter experts. It uses a SWAT team approach to bring the best and most experienced to analyze our clients’ data. The assessments are completed in 2–4 weeks so the client can begin harvesting potential gains as soon as possible. We recommend this approach because agility helps us maintain the focus and get to end of job with less chance of an interruption.
For the purposes of this discussion, I will divide the practices into two simple classes. The first class is comprised of the more obvious, standard best practices and the second class is an examination of the hidden problems running under the surface. They are experiential and based on a summary of output from past studies and interviews with SMEs. You can use these examples to guide your own data management practices.
The Enterprise Revolves Around Data
When you add a data set on z/OS, it must be cataloged in the directory. If the data set is called Mydata.b.c, then an alias called “Mydata” is added to the master catalog that points to a user catalog with the entry Mydata.b.c. That entry points to the volume serial and its volume table of contents (VTOC), which lists all of the data sets and their attributes on the volume.
If there are virtual storage access method (VSAM) files, or storage management subsystem (SMS) managed non-VSAM data sets on the volume, then a VSAM volume data set (VVDS) is required. The data set is assigned to a storage pool and a data class. It must be protected by a security solution like RACF or TOP Secret.
The data will inevitably be backed up—and in many cases with duplicate and often redundant copies. It will be copied to the disaster recovery site and then maybe to an air gap solution. Clearly, eliminating one unnecessary data set is more valuable than simply reclaiming its disk space. Therefore, our primary focus is the efficient storage of data, down to the byte level.
Getting Started: Configurations and Connections
We start our assessments from the top down, beginning with architectural documents showing installed devices, including their features (hardware configurations) and how they’re connected. We believe that each feature on the hardware must be tied to a business requirement and that there’s an optimal number of features depending on how each device is used (i.e., throughput best practices). Our best practices stem from advice from different vendors as well as the results of numerous studies we have conducted over the years.
For example, keeping the ports from the local synchronous mirror physically isolated from the ports for the asynchronous disaster recovery solution on the same disk subsystem can make a difference in performance. In other situations, we find that channels are frequently over-provisioned, driving unnecessary cost. This observation seems innocuous, but unnecessary ports lead to unnecessary If cables and FICON directors. We’ve seen clients with 170 excess ports in a four-director, three-processor environment. In this situation, reducing the number of ports does not result in increased risk or I/O response time, but can result in substantial cost savings.
The best practices we developed are often conditional on the workload. For example, we may start with the premise that the disk subsystem cache requirement falls in the range of between .5 and 2 gigabytes per terabyte of usable storage. However, data from different tools like RMF, CMF, monitors and modelers may tell us that this particular client requires three gigabytes of cache per terabyte of usable data.
Performance Indicators
When we dive into performance, we start with the basics in BMC’s Comprehensive Management Facility (CMF) or IBM’s Resource Management Facility (RMF). This includes the IIPCP value used to determine if there is demand for more zIIP. Other reports in these products can point to delays in the channel subsystem (pend time in the device activity report), or delays reading the data directly from disk (disconnect time in the device activity report and cache hit ratios).
We also look at hidden indicators that are not easily accessible via performance monitors. This includes exploiting SMF 113 records to see if the system is being over-initiated or is at risk for thrashing. Sometimes reducing the number of initiators or segregating TEST from PRODUCTION processing can measurably improve throughput simply through more efficient use of memory.
Another example revolves around the concept of vertical polarization, or optimal use of vertical highs, vertical mediums, and vertical lows. A vertical high is defined as a physical engine dedicated to a LPAR. A vertical medium is a physical engine that can be shared between LPARs. A vertical low is a physical engine with no requirements that is parked until needed. These resources are assigned with more granularity as logical engines. If the ratio between number of logical and number of physical engines is too high, processor resource/systems manager (PR/SM) hypervisor overhead increases. This increase can be sizable. Our analysis of one particular bank revealed a significant PR/SM thrashing problem. When tuned, an upgrade on the processor was deferred for 14 months and our client avoided substantial costs.
Storage Utilization
Our interest in storage capacity centers around true utilization. Clients are often billed for allocated capacity, but they can have 100 gigabytes allocated and only use one. From an allocation perspective they are at 100%, but true utilization is only 1%. This is an obvious tuning opportunity. The hidden opportunity might be in the form of poor blocking factors that increase I/O and waste capacity. Very small blocks waste storage because of additional overhead (inter-block gaps). And more blocks mean more I/O. That waste doesn’t show up in capacity reports.
Available Storage Capacity
Another example of an obvious tuning variable is available capacity; the hidden problem is fragmentation. Modern IBM Z systems allow one-terabyte virtual volumes (excluding certain limitations), but many environments still have the smaller mod3, mod9, mod27, and mod54. A mod3 virtual volume is approximately three gigabytes of storage while a mod54 is 51 gigabytes.
Imagine moving the data from 20 MOD54 (51 gigabytes) to a single one-terabyte volume (1024 gigabytes). The reduced fragmentation and overhead may make it possible to drive utilization from 70% to 85%. It also reduces work for the storage administrator in the form of fewer volumes, VTOCs, VVDS, Global Mirror pairs, FlashCopy pairs and other definitions that are volume dependent. Larger pools of continuous space help reduce space abends.
We believe solutions should be crafted to solve specific client pain points. We start with data because data lie at the core of everything. Protecting data and improving access can create dramatic improvements downstream. Like a sculptor, we search a system for the proverbial likeness of our data-processing David and carve out anything that does not belong.