The Art and Science of AIX Performance: The System Monitors
The first three installments in this series of articles cover much ground. We’ve learned how to deploy statistics-gathering programs to collect data on the behavior of system workloads, we understand that maintaining a complete history of those systems and what runs on them is essential to our diagnosis, and we know that keeping firmware up to date can prevent many problems―not just in performance but also in areas like stability and security―from occurring in the first place.
Now let’s move on to the next toolset in our diagnostic arsenal. Part 3 covered the stats utilities―vmstat, iostat and netstat―that show how your environment is operating and tell you when a problem is occurring. Now we’ll look at the tools that can help you pin down where that problem is occurring: the system monitors.
svmon
Let’s start at the top. I call svmon the king of the monitors, because after 20-plus years of using it, I haven’t come close to exhausting svmon’s possibilities. It includes dozens of flags and arguments to those flags, any of which can be used in a nearly limitless number of combinations.
As your system’s virtual memory monitor, svmon captures snapshots of the current state of memory and presents the data for analysis. These snapshots are invoked for different purposes, depending on the situation. You’ll use some more than others, and a few you may not use at all. They include the command report, the detailed segment report, the global report, the process report, the segment report and two types of workload management reports. There’s also the option to present svmon data in XML format (as opposed to the default of ASCII text).
How you invoke svmon is dependent on the situation. Say you’ve determined that your system is running low on memory. Over time, pageouts are increasing while free memory is decreasing. Now that you know what’s happening, you need to figure out why it’s happening. Whether it’s memory leaks, pool constraint or some other abnormal condition, svmon almost certainly can be invoked in a form that will diagnose the problem. But let’s start with the basics.
As root, enter “svmon” at a command prompt. In this most basic form, svmon shows you the amount of memory on your system and breaks that allocation down by page size. The default svmon behavior is to display memory in terms of 4K pages. To display in megabytes or gigabytes, add the “unit” option (which also enhances the display’s readability.)
Many times, you’ll need to know how much memory each process in your system is consuming. In these cases, the ps command will take you only so far. To locate processes that are consuming memory and determine how many pages of each size (small, medium, large and huge) are being used and how much the Small and Medium processes are paging out, you need svmon. While we’re on the subject of per-process memory, svmon makes diagnosing memory leaks easy; just zero in on your suspect process and invoke svmon with a sample rate and duration, then wait to see if your working segments consume additional memory without ever returning it to the system.
Along with being the go-to method for detecting memory leaks, svmon can be run over time, which makes it very useful in detecting peaks and troughs of memory usage. Here’s a simple form of this svmon invocation:
svmon -i 3 3
This says to run svmon with an interval rate of 3 seconds for a total of three data samples. Here’s where you can get creative. By adding various combinations of flags and arguments in your command strings, this interactive form of svmon provides an in-depth look at your system’s memory usage.
With many hundreds of different flag/argument combinations available to you, my best advice is to print out the man page and memorize it. For this reason, I’ll only give you some guidelines. How you choose to invoke svmon―and filemon and netpmon―may be totally different than how I do it, and that’s fine. Dealing with your system’s performance problems starts with doing things your way.
filemon
A trace-based utility, filemon invokes the trace program to capture its data and present it for analysis. I’ve always used filemon as an intermediate step in performance diagnosis, because it presents the detail on how your storage environment operates that iostat only hints at. While it lacks the versatility of svmon, you can still do a lot with filemon. It includes about a dozen flags that can be joined with many arguments.
Used alone or in combination, filemon allows you to zero in on the performance of hdisks, logical volumes, filesystems and files. It can identify storage structures that are read- or write-heavy and provide hotspots for each. It can also tell you which structures are the most active overall and which processes are accessing which files. And that’s just for starters.
When I run my first filemon report on any system, I always use a vanilla invocation. This provides the broadest picture of storage performance. Armed with this information, I’ll then tailor subsequent filemon reports according to what I see in the initial run. Here’s my initial filemon command:
filemon -v -o fmon.out -O all ; sleep 10 ; trcstop
This form of filemon runs for 10 seconds, invoking a look at every storage structure in your system. It writes output to the fmon.out file, then self-terminates. Always include a sleep statement in your command string to tell filemon how long you want it to run, and be doubly―no, triply―sure you finish with a trcstop. Remember how I said filemon invokes the AIX trace facility? The trcstop command gracefully shuts down trace. Without trcstop, trace will continue to run and eventually impose a severe performance penalty on your system, so never forget to terminate trace.
filemon can be used interactively, as with the above command string, or it can be used to extract storage performance data from an AIX kernel trace. I find the latter use of filemon extremely helpful when I’ve run PerfPMR once a performance issue has manifested. I can then run several different filemon reports to pinpoint root cause.
That’s the interactive way to run filemon, but it’s not the only way. Say you have some AIX kernel trace files, either run separately or through PerfPMR. These trace files were gathered at the time your performance problem manifested itself, so you need filemon data from that exact time; running it in interactive mode after the fact would be useless. Here’s how to extract filemon data from a trace file (assuming the default tracefile naming convention of “trace.out”; use your own file name):
filemon -i trace.out -n gensyms.out -o filemon.data -O all
This form of filemon says to extract your data from the trace.out file, gather kernel extension, shared library and process information via the gensyms.out file, and write that output to the filemon.data file. Finally, we capture statistics on all storage structures with the -O all argument; don’t omit anything at this point.
Again, you can do a lot with filemon, and we’ve barely scratched the surface as far as this utility’s capabilities to diagnose storage problems. I encourage you to practice both the interactive and trace-extraction forms of filemon with many different combinations of flags and arguments. For a refresher on generating trace files, see my articles on tracing and PerfPMR (here and here). I also plan to cover AIX kernel tracing in a future article series.
netpmon
Our final system monitor is a pauper compared to “King” svmon, because this utility stands as one of those forgotten or never-learned programs. Nonetheless, netpmon―the monitor that helps you diagnose network performance issues―can be a lifesaver.
I’m not sure why so many administrators overlook netpmon. Seriously, I’m the only performance practitioner I know who uses it. In the networking realm, netpmon is as useful as filemon is for storage. netpmon invokes the trace utility as part of its function to report on network I/O and how the networking subsystem uses CPU resources. As such, it can provide a finely detailed view of network operations that other high-level utilities like netstat simply can’t.
Since netpmon uses the trace facility, it must be also invoked with a sleep statement and ended with a graceful termination of the underlying trace. As I do with filemon, I always start off using netpmon in a vanilla manner. Once I’ve seen what there is to see in my initial report, I’ll invoke netpmon in any number of different flag/argument combinations for further network studies. Here’s my initial invocation:
netpmon -v -o netmonitor.out ; sleep 10 ; trcstop
In this form, your netpmon report starts by listing your system’s most CPU-intensive processes. You’ll learn not just how much total CPU each process consumes, but how much of that usage exercised your networking subsystem. From there, netpmon reports on interrupt handlers, followed by socket calls (which are in terse form, and then broken out by protocol). As with svmon and filemon, netpmon’s flag/argument combinations are many, so again, print out the man page and get practicing. And like filemon, netpmon can be used either interactively or in offline mode to analyze previously recorded trace files. Here’s how to use netpmon to extract network data from an existing trace file:
netpmon -i trace.out -n gensyms.out -o netpmon.trace.extract
Look familiar? This syntax is identical to that used when extracting filemon data from a trace.
The best thing about netpmon is that it points up many network difficulties that other utilities miss. That’s why I use it.
Leave Nothing to Chance
So there you have them: The monitor utilities are your intermediate step in the diagnosis of systemic performance issues. Get in the habit of using them often and in many different forms. In this way, you’ll leave nothing to chance in your diagnostic efforts.
I’ll conclude this article series with a look at what I consider two of the most essential skills in systems administration: intuition and instinct. As with any technical skill, intuition and instinct are honed only through years of effort. By putting in this work, you can be rewarded with the remarkable insights into whatever environment you manage. Chances are you’ve been in a situation where everyone’s telling you how to deal with a performance issue, but your inner self is telling you something completely different. In my final installment, I’ll explain why you should always go with your gut.