Trapping the Elusive Error with AIX Kernel Tracing
Every computer administrator must deal with error conditions. This is because every operating system, database and application experiences its own particular types of annoyances. Learning how to combat these conditions requires a combination of experience and instinct.
Error conditions can manifest on a regular, recurring schedule, or at totally random times. They may occur in a few microseconds, or over the span of several full seconds. Many are harmless, but some are capable of bringing down an entire production environment. Within the context of performance, however, error conditions are rarely benign; there are only degrees of the adverse effects they can impose on a system.
Spotting these errors requires, foremost, time spent intensively studying your own AIX environment. Numerous AIX system logs can help you identify error states. Of course there’s the error report (errpt), but while this log presents a lot of information, it’s not always sufficient. In general, an I/O error statement in the errpt, or a simple notification of the failure of this or that component (which the errpt may not even specifically identify), give you little to go on. The sense data in each errpt entry does identify particular components, but few of us have access to the codes that decrypt those hieroglyphs.
So sometimes you’ll have an error that’s obliquely identified, and you’ll need to pin down the actual problem and the effect it’s having on your system. In most cases though, error conditions won’t be flagged in any AIX system logs or detected by standard admin tools. Like a lot of UNIX demons, they’ll pop up occasionally, do their dirty work, and then retreat into the background. They’re basically invisible―until you learn how to hunt them down and trap them.
Get to Know the errno.h File
I’ve written several articles on using AIX kernel tracing, both as a standalone diagnostic tool and as a part of the PerfPMR diagnostic suite. Re-familiarizing yourself with that material will serve as a refresher in how to use AIX kernel traces to capture diagnostic data, but there’s also the matter of setting up tracing, which I’ll review in a moment.
But let’s start with something relatively new. Since its early days, a standard list of error states has been built into AIX. This is known as the errno.h file. To get at its contents, follow these steps on a non-production system:
- Login as any user; root privilege isn’t needed.
- Change your working directory to /usr/include/sys .
- Page through errno.h and then print out a few copies to stash in your operational run book as well as your big binders containing the configuration listings of all your AIX systems. (You do have the big binders, yes?)
Now take a few minutes to familiarize yourself with the format and content of the errno.h file. What did you glean from your reading? Probably the first thing you noticed is the acronyms that appear in nearly every one of these 200-some lines of text: e.g., EPERM, EIO and EBADF. These definitions represent error states along with the most common conditions that occur in―and are trapped by―various AIX code routines, the trace facility among them.
To the right of each definition is a terse description of what the error signifies, enclosed in a C programming language comment ( /* */ ). Each definition is, by itself, a category of sorts. This stand-alone data wouldn’t be all that helpful, but when examined in the context of an AIX kernel trace, these labels tell you a great deal. Each label indicates the existence of an error and includes its general type. The trace data that surrounds the error completes the picture, allowing you to understand the condition and develop a methodology to get rid of it.
For now, I’ll show you how to spot the errors that errno.h lists. I want you to be able to run a trace and search for these particular errors so you’ll get a feel for how often they occur in your system(s). In future articles, I’ll show you how to interpret the context in trace data so you can pin down the exact cause of a given error.
The Search for Trouble
So let’s look at some examples of what I’m talking about. Of course we first need to generate trace data so we have some errors to examine. You can use any syntax to generate a trace file; just don’t omit any hooks or events. I typically start with this:
trace -a -fn -o trace.out -T40M -L400M ; sleep 5 ;trcstop
This syntax says to run a trace asynchronously in what’s called single mode, add some information to the trace header, use a trace buffer size of 40 mb (vs. the default of 128K), and write the trace data to a log of 400 megs in size (over and above the default of 1 mb). The trace runs for 5 seconds, as defined by our sleep statement, and terminates with the trcstop command. This form of the trace command will capture data from every CPU in your system to the trace.out file. Granted, writing a trace file for every CPU individually (with the -C all flag) does provide cleaner data, and all those individual files are a bit easier to read. Keep in mind though that all we’re after here is bulk error data. Once the trace data is captured, we need to convert it from trace format to ASCII. As I’ve noted in previous articles, we use the trace report (trcrpt) command for this:
trcrpt -Oexec=on,pid=on,tid=on,cpuid=on,svc=on trace.out > trace.formatted
This version of trcrpt says to format our trace.out file using the full executable path names for every program running during the trace and display process and thread IDs. It also shows us the CPU number that those executables are running on and any service handlers (generally C language routines) that are invoked on behalf of the process. We then write our formatted data to a file called trace.formatted.
Now we’re ready to look for trouble. The easiest search method is to simply page through the trace.formatted file and search on the word “error.” You’ll likely spot some (hopefully non-serious) errors that aren’t captured in your system logs. Again, these guys are pernicious and generally silent―that is, until you break out the heavy-duty tools to track them down.
I’ve written a script that searches for each error description in the errno.h file. It’s a little tedious to setup, but once it’s actually running, it’s a great time-saver.
The following output is the result of a trace I ran on a system that was experiencing performance difficulties. Nothing was indicated in the logs. The first occurrence of an E error is shown in bold text below. Focus on the EACCES acronym, which our printed errno.h file describes as a permission denied error. This usually means that whatever program was running on the line that generated EACCES tried to access some data (or memory or a routine) for which it had insufficient privilege. The trace daemon noted the access violation and flagged it.
101 bash 0 59900690 75236001 -032A55B0- 0.008397031 0.000637 032A55B0 LR = 90000000003F938 16E bash 0 59900690 75236001 -032A55B0- 0.008397675 0.000644 setpgid pid=prKernelPID(3DF034A) pgrp=64947018 104 bash 0 59900690 75236001 -032A55B0- 0.008398367 0.000692 return from 032A55B0. error EACCES [1 usec]
The full explanation is that bash (one of many UNIX shells) was doing the access. The data surrounding the line containing EACCES indicates that bash started a system call that attempted to set its privilege for something (in this case, a storage replication script running in bash) to which it had no rights. The access violation was flagged and the attempt was terminated. Fixing this is simply a matter of adding a general security enhancement to the script so that when it’s accessed with bash, the sufficient authority would exist to set the privilege needed to access certain groups. That was easy.
Another example is below. This is one of my favorites. I know it sounds funny to have an affinity for an error condition, but EAGAIN has helped me on countless occasions. EAGAIN flags errors that are caused when a resource―usually a device of some sort―is unavailable to the program attempting the access. It’s raised when some sort of non-blocking I/O is happening in a system. It means: “No data is available now. Try again later.”
104 backup_prog 0 17302152 16777481 -032A3150- 0.038256332 0.000178 return from 032A3150. error EAGAIN [5 usec]
I see this all the time with my clients. For whatever reason, backup times―usually for databases, but it can be anything―start the agonizing crawl upward, from say, 30 minutes to an hour to multiple hours. There’s nothing definitive in the backup program’s logs or the database and AIX logs. At this point, running a kernel trace and searching on the E errors is your best course of action.
In one particular case, a volume group import was aborted to due an issue with the backup software. The net result was a non-stop search for non-existent data to backup. This consumed so much time that entire backup schedule was missed. Bad as that was, the fix was pretty simple. We modified the backup schedule to allow time for the import to complete; the backup program then found the data it was seeking, and backups proceeded normally.
Keep in mind, this was only revealed by EAGAIN. There was nothing in the system logs or anywhere else, but once we noticed the stream of EAGAINs in the trace file―and examined the surrounding contextual data―we were pretty sure we had a correct diagnosis.
Here’s one more example: I was running some kernel traces on a system, thinking I had a line on the problem. I ran my script and found a bunch of EINTR occurrences in relation to the xntpd daemon. EINTR is a POSIX error that results from any number of blocking conditions: receives, sends, semaphore waits, stuff like that. Of course xntpd is how the network time protocol is implemented in AIX.
104 xntpd 0 7274986 16122361 0.768281363 0.000729 return from system call. error EINTR
To this point, none of us had noticed anything wrong with xntpd, so we were surprised when these errors appeared. Normally, EINTR is a trap that’s invoked when one of the blocks I just mentioned occurs. What’s supposed to happen is that the running code that encounters EINTR simply retries whatever action it’s attempting. On the system I was monitoring, we had 27 EINTRs in a 5-second trace. Not an outlandish stat, but a bit puzzling. The upshot is that the network adapter over which xntpd was doing its thing was throwing totally bogus errors. The adapter’s driver code was recently modified, but because an old, now-obsolete error-handling routine hadn’t been removed, the programs that used the adapter started spewing EINTRs in response to a condition that didn’t really exist. Once we updated the driver, these particular errors went away. In this case, EINTR pointed us to an issue we weren’t even looking for. That’s how valuable these E error descriptions can be.
The Hunt Begins
It’s pretty simple, really: Print out the errno.h file, and then by whatever means is most convenient, search for every error that’s listed in the file. That’s how you hunt down the elusive E error.
And remember: E errors are very often undetectable until you know how to spot them. They’re often benign and won’t cause your system any obvious performance difficulties, but they can occasionally be serious. In any case, I’m willing to bet that once you start your own search for E errors, you’ll be surprised by what you find.