The Art and Science of AIX Performance: The Stats Utilities
An article series dedicated to AIX and the methodology of attacking and resolving performance problems.
This is the third installment in my series on AIX performance. Part 1 focuses on current firmware, the foundation upon which any good AIX system performance is built. I also explain why taking a detailed history is so important to diagnosing performance problems. Part 2 explores the depth to which one can understand how a system is put together. I examine the various configuration files generated by using utilities like prtconf and PerfPMR. In addition, I cite a few of the many logs that contain valuable information about adverse events in your systems.
Parts 1-2 provide everything needed to start a logical, fact-based investigation into any performance issue. Some final points of emphasis: Make sure PerfPMR, the IBM performance and diagnostic utility, is installed on all your systems. Likewise, make sure that, at minimum, these filesets are installed on every system: bos.acct, bos.perf.tools, bos.net.tcp.client and bos.sysmgt.trace. Then study the man pages and online documents for all of the tools these filesets contain; these pages also offer guidance on using PerfPMR.
When Trouble Finds You
Now what? First, some don’ts. Don’t tune systems that don’t need it. Don’t twist any VMO, SCHEDO, IOO or NO dials unless it’s absolutely necessary. Basically, don’t go looking for trouble. Rest assured, the trouble will find you, and most of the time you’ll be caught unaware.
At some point, probably off-shift or on a relaxing weekend, you’ll get paged out about a performance problem. This is why I stay logged into my NIM server 24-7, both at work and from home. This allows me to instantly connect to any system in my environment that’s experiencing an issue. I also keep HMC console windows open to my most critical systems. Doing these things allows me to get to my problem(s) as quickly as possible.
No matter the contents of an initial problem report, I’m starting my problem analysis with vmstat. It provides a high-level view of a system’s CPU and memory performance. It also gives you a tangential feel for how storage is behaving.
About 15 vmstat flags can be used in hundreds of combinations. For the quickest and best information, invoke it this way:
vmstat -w 2
This syntax starts vmstat and samples its statistics every 2 seconds, providing instant feedback. Once the short-term vmstats are started, open another terminal window to start long-term vmstats:
nohup vmstat -w -t 60 11000 > system_name_date_started &
This approach covers two bases: The short-term highlights any immediate system problems, while the long-term, recording its data over time, identifies issues that may repeat or be lasting in their effects. Get in the habit of starting both forms (along with iostats, as we’ll soon see). Make sure to start vmstat statistics in wide format with the -w flag. This greatly increases readability and nicely lines up column headings with the data (particularly if you use the courier font at 10- or 12-point). It takes a sample of performance data every 60 seconds, and runs for about a week. Write your data to a file with the greater than (>) sign, labeling it with the start date and your system name.
Now open yet another terminal window and start PerfPMR. If you’ve read my 4-part PerfPMR series, you know to start PerfPMR in a directory large enough to contain all the data PerfPMR will generate, and to let it run for the default of 10 minutes.
While PerfPMR data provides a deep dive into performance problems, that quick, high-level view is still needed. So keep monitoring your vmstat data. Rather than present an image of vmstat data, I encourage you to start vmstat on one of your own systems and keep it running while you read this article. I’ll walk you through what you’re seeing.
CPU and Memory: What to Look for
Since most AIX systems utilize shared CPUs these days, first check if your CPU entitlement is being exceeded, and if so, by how much. Next, eyeball the PC column. If you elect to add more CPUs via dynamic LPAR (DLPAR) as a stopgap measure to alleviate a CPU shortage, here’s where you’ll learn how many CPUs are needed. Also look at the “r” and “b” columns at the extreme left of the output. This tells you the number of threads running in your system along with the number that are blocked from running, for whatever reason. As a rule of thumb, tally up all the logical processors (LPs) in your system and compare that number with the “runnable threads” figure. If the number of runnable threads is less than or equal to the number of LPs, you probably have enough CPU horsepower to accommodate your workload at that moment. However, if the number of runnable threads exceeds your number of LPs, this is evidence of a CPU shortage that will soon cause your blocked thread count to spike.
Now check your memory columns. Is your system paging? If so, by how much? How many memory pages are being paged in and paged out at any given time? Do your page-outs exceed your page-ins? Has your free memory list dropped to zero? These are indications of a memory shortage that can also be corrected via DLPAR, assuming you have the free memory available (or can activate more memory via Capacity on Demand). Also when determining memory over-commitment, see if your active virtual memory is greater than your installed memory.
Next is the context switching rate. Has it shot up since the performance issue was reported? Likewise, are your interrupt and system call rates now sky high? All of these counters point toward a resource shortage.
There are two schools of thought on how to proceed once you’ve diagnosed a CPU or memory shortage. The first, which has its merits, says you should simply add more CPUs or memory to the ailing system. In many cases, these actions will, indeed, return system operations to normal. Also, most of you have your systems set up for DLPAR where CPUs and memory can be added and subtracted on the fly. When every minute of system slowness can cost thousands, this course of action may be your only choice. However, it should be obvious that this approach tells you nothing about the cause of your shortage. Applying a bandage may mask a more pernicious, underlying problem. So I subscribe to another philosophy: Use vmstat to determine if there’s a resource shortage, but wait until you have more data so you can make an informed decision about root cause.
Once you’ve confirmed a CPU deficit, go to your PerfPMR data and extract a CPU Usage Reporting Tool (CURT) report from kernel traces. With a properly formatted CURT report―and by incorporating the flags I discuss here―you’ll be able to pinpoint the cause of your CPU problem within minutes, and with a high degree of confidence. Is a system call stuck in a loop and chewing up most of your CPU horsepower? Has an application gone to sleep and not resumed its work? Has interrupt or hypervisor activity gone crazy and monopolized your CPUs? Has a C subroutine gotten stuck? Rely on CURT for all this information. And keep in mind that a moment spent scanning a CURT report could save your company a lot of money in hardware costs. Should you find yourself in a situation where managers are breathing down your neck for results, let them yell if they must. You’ll know that taking the time to diagnose CPU issues with some degree of assurance is best for the business.
What I just said about CPU shortages also applies to memory―especially memory. Very heavy paging is likely to slow your system to a crawl. Any number of adverse memory conditions can ultimately cause a system to page or, in extreme cases, to thrash. A DLPAR operation can often provide a temporary fix with memory shortages, but again, vmstat will help you determine the problem, but not the cause.
Now you need PerfPMR. In the directory where your PerfPMR data has been recorded, you’ll find a directory called mem_details_dir. It contains about a dozen reports that will greatly aid you in pinning down memory problems. One such report, memdetails.out, tells you exactly where memory is allocated in your system. Is the allocation mostly in client, persistent, text or shared libraries? Do those allocations look right based on what you know about your system’s behavior? Do they look right to your DBAs or application developers? How about memory leaks, one of the most common conditions that can deplete a system’s free memory? The mem_details_dir directory also includes several SVMON reports; use these as a baseline and add further SVMON studies to determine if processes are leaking memory.
My point is that since you made the effort to find your memory issues, take the time to pinpoint their cause. Simply adding memory won’t make leakers or other conditions that gobble up free memory go away. Most adverse memory conditions can only be fixed with some sort of code modification. With both CPUs and memory, get out of the habit of trying to fix problems by throwing more hardware at these resources. Let the system run poorly for a few minutes while you make a complete diagnosis. That’s how you truly “fix” these issues.
iostats and Storage Performance
Remember when I said that vmstat could tangentially point you to a storage performance problem? Let’s explore that. Check the wait (wa) column in your vmstat output. This tells you how long your CPUs were idle, waiting for I/O to complete. (Incidentally, this I/O can be to a storage or an NFS filesystem. For now, we’ll only deal with I/O waits to storage.) Ideally, your wa column should always read 0, but since very few of our systems are tuned perfectly, wa counters in the low single digits are acceptable. Should I/O waits go above 10, that’s when I suspect a problem and start running input and output statistics, aka iostats. This can provide a complete picture of storage performance.
I use iostats similarly to the way I use vmstats. First, start a short form and let iostats run:
iostat -Dl 3
Then in another terminal, start a long form:
nohup iostat -Dl -T 300 600 > lpar_name_date_started &
With the long-form, incorporate extended drive statistics and apply a timestamp to the data. This invocation of iostat will run every 5 minutes for 600 samples. This provides a fairly complete picture of I/O activity: the length of reads and writes, their minimum, average and maximum times, and the time needed to service those reads and writes.
If you’ve found nothing in your vmstats, and CURT and PerfPMR data offer no evidence of any CPU or memory problem, iostats for storage is the next place to look. Start with the serv qfull column on the extreme right-hand side of your output. A full service queue is the bellwether to determine inadequately tuned storage. Look for high counters in this column. Any serv qfull number in double-digits is suspect. Triple digits in serv qfull is a definite problem and counters in the thousands―then it’s time for extreme measures.
An overflowed service queue is the most common undiagnosed storage performance problem in any AIX environment. What could be the cause? Running the lsattr command should provide an answer. Pick an hdisk in your system and do an lsattr against it (where ## is your hdisk’s numerical identifier):
lpar # lsattr -El hdisk##
…lines omitted ….
queue_depth 3 Queue DEPTH True
This example is limited to a single line of the resulting output: queue_depth. This tunable tells you the size of the I/O queue to a particular device; in essence, it governs how many I/Os can be outstanding to that device at any given time. So the larger your queue_depth, the more I/Os the device can handle without those I/Os having to go into a holding pattern.
Most storage vendors, including IBM, set their initial queue_depth values very low, usually in the single digits. I’ve been told that the reason for low, shipped-from-the-factory queue_depth values is that vendors expect their customers to adjust them as needed for their environments. But I find this explanation suspect, because to adjust the queue_depth value, you have to know it’s there to begin with. And to determine that your default queue_depth is inadequate, you need to know where to look. Whatever the reason, default queue_depth values are almost never sufficient, particularly for environments with I/O-heavy databases.
In general, the higher your serv qfull values, the higher your queue_depth needs to be. At least half of all the storage performance issues I’ve seen in my nearly 20 years of experience have been fixed with a queue_depth adjustment. The caveat here is that any number of conditions can cause an overflow of the service queue in the first place, so, as always, forming a complete picture of the problem to make an effective diagnosis is essential. And remember that hdisks and queue_depth don’t exist in a vacuum. Other storage tunables―and other storage devices―must be taken into consideration when evaluating storage problems. But look to your serv qfull data first, and then twist the queue_depth dials as necessary.
One final storage-related tip: What about those storage adapters that are connected to your hdisks? Again, diagnosing one device and tuning it to the exclusion of other devices is a fool’s errand. Very few of you have LPARs that contain only the internal disks that shipped with your IBM Power Systems* hardware. Typically, you’ll have some sort of storage array attached to those boxes. And how are they attached? With either a physical or virtualized storage adapter. Much of the time, those adapters will be fiber channel.
Say you’ve found a problem with queue_depth and tuned that value on the affected hdisks. But did you know that fiber channel adapters have a similar parameter? It’s called num_cmd_elems. So run an lsattr -El fcs# (# is the numerical designator for one of your fiber channel adapters). See the num_cmd_elems value? This is, essentially, queue_depth for adapters. Because this value is also set very low at the factory, num_cmd_elems also almost always needs to be adjusted. The default value is usually 200, but most database vendors recommend a starting value of 2048.
To determine if your fiber channel adapters need num_cmd_elems tuning, use fcstat. While running an fcstat against any of your fiber channels yields quite a bit of output, we’re only interested in a few lines:
fcstat fcs0
…lines omitted ….
FC SCSI Adapter Driver Information
No DMA Resource Count: 128
No Adapter Elements Count: 512
No Command Resource Count: 1024
Taken together, the no adapter elements and no command resource counts tell you if the adapter’s service queue is being overflowed and by how much. (The no DMA resource is tuned separately.) Tune num_cmd_elems so that any positive values in no adapter elements and no command resources drop to 0, and don’t be surprised if your num_cmd_elems values eventually go to 4096, 8192, or even higher.
Dealing with so many disparate devices―physical and virtualized, IBM and non-IBM―makes storage arguably the most complex area of performance tuning. Storage tuning only starts with queue_depth and num_cmd_elems; many other tunables must be considered. I plan to write about this in depth in the near future, but for now, let’s cover one more stats utility: network statistics―aka, netstat.
Our final indispensable stats program, netstat can tell you at a glance if anything is amiss with your networking subsystem. Like vmstat and iostat, netstat has many flags, each of which has its place in diagnosing network performance problems. For now, we’ll focus on the go-to flag to use when you’re pressed for time: the -v flag. While the “v” doesn’t stand for versatile, it could. That’s because a netstat -v, in addition to providing networking information, provides statistics on fiber channel adapters that are nearly identical to the numbers contained in an fcstat. Keep that in mind when you’re evaluating both resources.
With networking issues, look initially for errors of transmission or reception that may impact system health or performance (see figure 1). Much of the time, TCP/IP errors manifest themselves on the receive side, so check the data on the right-hand side of your output. Under the receive statistics column, note any receive errors as well as dropped packets. Then scan lower in the output for any DMA overruns or no resource errors. Disregard alignment or collision errors for now; these happen rarely, and besides, their occurrence generally indicates a problem you won’t be able to fix on your own. No resource errors, on the other hand, may only require additional network buffers to correct any number of problems that can occur with your TCP/IP stack.
Think of netstat -v as a starting point. It directs you to the areas that require further diagnosis. It also helps you determine which team members (e.g., networking hardware or DBAs) you’ll need to involve to complete a solution.
The Forest View
In this article, I’ve shown you the tools that provide the most performance information in the least amount of time. These stat tools have helped me tremendously in my own performance practice over the years. They never fail to give me the high-level, “forest” type of view that’s so essential for gaining a quick understanding of performance problems. As you gain experience, your diagnostic methods will likely deviate from mine―maybe a little, maybe a lot. And that’s okay, because that’s the art of performance: understanding your environment and the tools you need in any situation.
So that’s the forest view. In the next installment, we’ll go among the trees. I’ll share some of the tools that help me form a complete a performance diagnosis.