Analyzing System Dumps
A system dump indicates a severe problem with an AIX system. System dumps usually halt the system, necessitating a reboot.
I previously showed you how to read the contents of a core dump to determine the cause of a fault by an executable. An executable is any program―such as an application, database, middleware or utility―that runs in a UNIX system. Core dumps are simply indicators that either the environment an executable runs in or a piece of code contained within that executable has faulted. In these instances, a snapshot of the memory area servicing the executable is written to a file that can be examined using several tools.
In this article, we’ll look at another type of dump: the system dump. In contrast to the largely benign core dumps, a system dump indicates a severe problem with an AIX system. While a core dump can be created and written to a file while an AIX system is up and running, system dumps usually halt the system altogether, necessitating a reboot. When a system dump occurs, AIX will attempt to capture the entire contents of memory and write that data to a file. As such, it’s imperative to have a dump device defined with enough space to contain the dump. (It’s possible to gain some insight diagnosing a partial dump, but you may miss critical data that could completely change the results of your dump analysis.)
I’ll assume your system is configured with adequate dump devices, and further, that it’s set up to capture a complete dump. With that in mind, let’s begin.
The Savecore Command
As with most system diagnoses, a little prep work is required. Let’s presume that your system dumped and is back up and running. (Of course, there are many reasons why it may not be back up. However, those issues are beyond the scope of this article.) Log into your system and run the error report; one of the entries will look like this:
67145A39 0413095315 U S SYSDUMP SYSTEM DUMP
Here, you see the system dumped flagged; expanding this entry will tell you the time and date of the dump, among other things. To begin your diagnosis, copy the dump from the dump device to a file using the “savecore” command:
savecore
Yes, the period is necessary. It indicates you want the dump copied to your current directory. As a general rule, make sure the free space in whatever file system you copy the dump to is roughly equal to double the size of your system’s memory. You’ll need the space for both the original compressed and uncompressed dumps. Using this form of the savecore command, you’ll get not only the dump itself, but a copy of the UNIX kernel that was running at the time of the dump. With the kernel file, if you find you don’t have enough space on your box for the uncompressed dump, you’ll be able to FTP it and the kernel file anywhere to do your analysis.
savecore will copy the dump to your current directory, and name it:
vmcore.0.BZ
BZ stands for bzip, the dump utilities’ preferred compression method.
Next, uncompress the dump using the dmpuncompress command:
dmpuncompress vmcore.0.BZ
dmpuncompress will run the bzip utility and do the decompression. You’re now left with a file called “vmcore.0”. Dumps – as well as the kernel files — are appended with a sequence of numerals indicating the order in which they were created; if dmpuncompress finds a dump already created with the .0 append, it will label successive dumps .1, .2, .3, and so on.
Lastly, format the dump:
/usr/lib/ras/dmprtns/dmpfmt -c vmcore.0
You should get a message that says the dump was completed properly and can be read:
” This dump appears complete – The end-of-dump component was found. “
Reading a Dump
Like using the dbx utility to read a core dump, system dumps have their own tool for this purpose: the kernel debugger tool (kdb). The kdb has many uses beyond reading dumps. I’ll cover some of them in future articles, but let’s focus on the task at hand. Read the contents of your dump and its corresponding kernel file into the kdb with this command:
kdb vmcore.0 vmunix.0
You’ll now enter into the kdb context, with a screen that looks like this:
[lpar_name:/datafs] # kdb vmcore.0 vmunix.0
./vmcore.0 mapped from @ a00000000000000 to @ a000000393998ee
START END name>
0000000000001000 00000000058A0000 start+000FD8
… lines omitted …..
Dump analysis on CHRP_SMP_PCI POWER_PC POWER_6 machine with
2 available CPU(s) (64-bit registers)
Processing symbol table...
.......................done
read vscsi_scsi_ptrs OK, ptr = 0xF1000000C01E1380
And you’ll now be sitting at a prompt that looks like this:
(0) >
This is your current context within the dump. The 0 means you’re looking at the context (really just the information) for CPU 0. (Most of the time, the kdb defaults to CPU0, but sometimes the kdb starts on the CPU# where the dump actually occurred.)
Now, at your kdb prompt, type in the stat command. You’ll be presented with output similar to this:
(0)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_6 machine
with 2 available CPU(s) (64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. lpar_name
release... 1
version... 7
… lines omitted …
time of crash: Mon Apr
13 09:52:09 2015
age of system: 13 day,
18 hr., 37 min., 28 sec.
xmalloc debug: enabled
FRRs active... 0
FRRs started.. 0
CRASH INFORMATION:
CPU -1 CSA 053A7E80
at time of crash, error code for LEDs:
70000000
This output contains key details. It tells you when your system crashed, along with your AIX version and when the system was installed. It also gives you an LED code (70000000 in this example) that mirrors the LED on the outside of your p Systems box. 70000000 is a program interrupt.
Now for the most telling command in this initial dump run-through. From your kdb prompt, enter “status”:
(0)> status
CPU INTR TID TSLOT PID
PSLOT PROC_NAME
0 15000D9 336 6A006A 106
sysdumpstart
1 1B0037 27 F001E 15
wait
2-3 Disabled
In this output, you’ll find every logical CPU in your system listed with what was running on it at the time of the crash. Most of your LPs will have been running the wait process. Where you see anything other than the wait process, that will be the program on which that particular LP halted. And guess what? It’s very likely it’s that program that caused your crash. In this example, LP 0 halted on the sysdumpstart action, which is indicative of a user-initiated system dump. And indeed, I forced a dump of this LPAR for this demonstration output.
Congratulations! You’ve just read and diagnosed your first system dump. About the only caution I have in this simple procedure is that the status command may not give you a complete picture of the many factors that may have caused your system to dump. In addition to the system halting on a particular program, there may be other factors―e.g., corrupted kernel extensions, poorly designed code or even a firmware problem―that contribute to a crash. In these cases, it’s best to seek the advice of IBM Support so they can take a fine-toothed comb to your dump. That said, in my 15-plus years of dump analysis, the method I just outlined has shown me the true cause of a dump in most cases. With this information in hand, you can apply any necessary upgrades, code releases or operating system updates to keep the system from crashing again.
Get to Know the kdb
While we’re on the subject of the kdb, let’s look at some of the additional information it can reveal when you use it to read a dump. If you elect to seek help from IBM Support, it may be worthwhile to familiarize yourself with some of the following commands, because IBM will likely ask you to run them, or do it themselves. First, do a stack trace of the functions that were being invoked when your system dumped. Call up the dump against your saved kernel with the kdb, and simply enter “f” at the (0) > prompt, then hit the return key. You’ll get output that looks like this:
(0)> f
pvthread+003A00 STACK:
[00187D6C]safe_read_excp+000000
(00000000000000FF, F10001001066FCA0,
0000000000000001, 0000000000000002,
0000000000000000, F00000002FF47600 [??])
[00070458]isinmem+0000C0 (??, ??)
[00069FB0]wr_cdtu+0008C8 (??, ??)
[0006B128]idmp_do+000640 (??)
[002DE490]newstack+000020 ()
[kdb_get_memory] no real storage
@ F00000002FF46EB0
In the context of the dump I forced on my LPAR, all this says is that a memory exception was encountered, which contributed to the crash. But let’s say you have an LPAR running an in-house application. Especially in the case of home-grown code, stack traces become essential at tracking down faults in that code that could make a system unstable. With major application vendors like SAP or EPIC, it’s doubtful you’ll run into code issues severe enough to crash a system. But for those small vendors or onsite programmers who have code in production, a stack trace in a system dump becomes very important.
How about getting a list of the extensions that were loaded at the time of a crash? This is especially helpful when a dump involves a function like an I/O operation or memory allocation. The number and types of extensions that were loaded―and perhaps most importantly, the order in which they were loaded―can go a long way to adding depth to your dump analysis. At your (0) > prompt, start by entering:
lle –k | more
This command displays the kernel extensions that were loaded at the time of the crash. (I’ve omitted some lines to consolidate the display):
ADDRESS FILE
FILESIZE FLAGS MODULE NAME
1 03F3DD00 042A5000 0000AFB8
00080252 /usr/lib/drivers/random(random64)
3 03F3DB00 044F8000 001C4808 00080252
/usr/lib/drivers/nfs.ext(nfs.ext64)
4 03F3DC00 044E4000 00001690 00180248
/unix
5 03F3D900 03F73000 00000430 00080252
/usr/lib/drivers/nfs_kdes.ext(nfs_kdes_null.ext64)
7 03F3D800 044C1000 00000398 00080242
/usr/lib/drivers/syscalls64.ext(syscalls64.ext64)
8 03F3D600 04290000 00003AA8 00080242
/usr/lib/drivers/aiopin(aiopin64)
Are you seeing an extension in this list that you weren’t expecting? Or perhaps that extension loaded well before or after it was supposed to? The kdb makes it easy to answer these questions. And it’s not just kernel extensions that can be displayed; information for shared libraries and specific processes or memory addresses can also be shown. In addition, a verbose mode to the lle command exists that will give you and your developers incredibly detailed information about what those extensions were doing.
For More Information
At the beginning of this article I said that the stat command would reveal an LED code that would point you to the general cause of a crash. While these codes provide the big picture of a crash, suppose you want more information? Enter the Machine State Save Area (mst) command. Among other things, mst displays a wealth of information on your hardware. But in diagnosing a system dump, a few lines at the bottom of mst output are always quite telling:
(0)> mst
… lines omitted …
Except :
orgea 0000000013FFFE30 dsisr 0000000040000000
bit set: DSISR_PFT
vmh 000000FF07FFFFFD curea 0000000013FFFE30
pftyp 0000000000000106
Aha! These last few lines will always start with a section headed by “Except”. As you might expect, Except is short for exception, and it’s these lines that amplify on the LED code. In this output, notice the bit set field; there’s the declaration, DSISR_PFT. If you get that far in reading your dump, always look at this one field; this will be the exact function your system crashed on. Remember: Status will show you the program that caused the crash; bit set will show you the function that was invoked on behalf of the program. In this case, my manual invocation of the dump function caused a data storage interrupt in my page frame table. The system couldn’t deal with that interrupt, so it crashed.
The Kernel Debugger Tool
I could write a book on the kernel debugger tool, but what I’ve said here is plenty for you to get started. I just want to give you an idea of the power and depth you have at your fingertips when you set about diagnosing the causes of a system crash. It’s absolutely worth your while to get to know the kdb well.
Additional kdb Information
Need more info on all the commands the kdb contains? Just enter a question mark (?) at your (0) > prompt and hit the enter key. Every command available to you in the particular version of AIX you’re running will be listed here. Also visit the IBM Knowledge Center to see a section devoted to the kdb. It’s basically in manpage format, but it will give you insight into the kdb’s many commands. The kdb is really the only game in town if you want to read an AIX system dump. Once you take a little time to get used to the way the kdb presents its information, you’ll be able to effectively diagnose system dumps.