Slicing Time: Curing the Context Switch in AIX
Now that we’ve laid a solid groundwork for performance analysis, I'll show you how to tune your systems based on information you’ve gleaned from your diagnostic data.
Over the past year I’ve shown you how to use advanced tools to diagnose performance problems with AIX systems. We’ve covered topics like kernel tracing and all of the reports that can be extracted from that data, CURT for detailed CPU usage, and SPLAT for system lock analysis. We’ve also examined PerfPMR, IBM’s go-to utility for diagnosing system issues.
Now that we’ve laid a solid groundwork for performance analysis, let’s shift gears. In this and future articles, I’ll show you how to tune your systems based on information you’ve gleaned from your diagnostic data. Being able to diagnose AIX performance issues is important, but the next step is what really matters. Now you’re ready to flip the switches, turn the dials and actually fix these issues.
Context Switching in Context
Let’s start with a fairly common performance problem: frequent context switching. A context switch occurs when a CPU stops processing a working thread to process another thread. The CPU is changing its operating parameters (or its “context”) and prioritizing another program.
Some degree of context switching is normal in any computer operating system—not just AIX. In fact, it’s vital. Without context switching, systems couldn’t process more than one executable; you would run one thread and that would be it. Obviously a system that runs only one thread isn’t getting much work done, so you need a mechanism that tells a CPU to shift its attention from one thread to another at an appropriate time.
There are actually dozens of conditions that initiate a context switch, including self-blocking threads and threads that are blocked by still other threads. A mechanism called CPU Decay determines how long a thread may remain prioritized on a CPU. Programs have built-in timeouts that limit a thread’s CPU time. In some cases, a thread may have its CPU usage curtailed to allow other more important system activity to take place.
In short, context switching is fine, but it’s a matter of degree. If your system is doing nothing but context switching, then you have a serious performance problem.
So we must first determine the difference between normal and abnormal context switching. There are no hard and fast rules for this. It simply requires a thorough understanding of your system’s workload and how that workload uses logical CPUs. More than black and white performance statistics, you must develop an intuition about your system: is it operating correctly or isn’t it? And it isn’t enough to know if your system is operating as it should. You need to know why—and on either count.
Analyzing the Numbers
Many utilities can be used to examine context switching, including Topas, NMON, lparstat and a host of other diagnostic programs. I prefer vmstat, because it provides system CPU and memory usage in a single concise display. Plus, the most recent incarnations of vmstat incorporate something called the PURR, which gives you accurate statistics in an IBM Power Systems SMT environment.
If you start vmstat in a vanilla manner—say, by issuing “ vmstat –w 2 “ at an AIX command prompt, you’ll be presented with a number of columns of data. For our discussion, we’ll focus on only one of these columns: cs, which of course stands for “context switch.” The cs column lies under the “faults” heading of your vmstat output, and its data will tell you whether the rate of context switching on your system is normal.
But let’s back up a bit. To start our diagnosis, we need two pieces of critical data. The first is a consistent sampling rate for our vmstats, or a consistent number of seconds between each line of vmstat output. Taking one set of vmstats with a 10-second sampling rate and another with a 2-second sampling rate will completely invalidate our average context switch findings. Consistency is needed so that our up and down rates of context switching are in proportion from one set of statistics to another.
The next thing we need to know is how many logical CPUs we have in our system, and how many hardware threads we’re dealing with. Recall that every Power Systems physical CPU has a certain number of hardware threads to which software threads are mapped for execution. Each of these hardware threads is presented to AIX as a logical CPU: POWER5 and POWER6 CPUs have two hardware threads which are presented to AIX as two logical CPUs, POWER7 CPUs have four logical processors (LPs) and POWER8 CPUs have eight LPs. Working threads will be dispatched to run on logical processors.
We take the number of LPs in our system and look at the total number of context switches as reported by vmstat. Let’s say we have a POWER7 system with 12 LPs. Let’s further assume we’re using a 2-second sampling rate for our vmstats to get a quick idea of the rate of context switching. In this scenario, we wouldn’t be surprised to see a total context switching rate somewhere in the low hundreds when the system is quiet. Even on a system at apparent rest, many internal processes are active, making our cs column non-zero.
Over time, you should watch vmstats and take careful note of application, database or other activity that would begin to drive the cs counters upward. Then implement vmstats over an extended period—weeks or even months—using a consistent sampling rate to baseline context switching activity. Say we take vmstats for a month and find that we average 1,000 total context switches per sampling rate while the system is under a heavy workload. Now we start our calculations:
Take the total number of context switches in a sample (1,000) and divide that number by the total number of LPs in our system (12). That gives us an average of about 83 context switches, per LP, per sample.
(One more important backtrack: It’s not enough to have a general idea as to our workload and the number of logical CPUs in our system. We also need to know how that workload uses our CPUs. Remember that most single-threaded workloads—like databases—will use only the first, or primary, LP in any CPU, and when that LP is saturated, the workload will fall over to the next primary LP on the next CPU. The generic term for this type of CPU usage is “raw throughput.” Multi-threaded applications tend to use more LPs on each CPU before they cascade to the next CPU; this type of CPU usage is called “scaled throughput.” You can see that these workload characteristics are very important in determining your context switch/LP average. More on this in future articles.)
If our system’s performance is good with this number of context switches, we simply continue our statistics. But let’s say we see a sustained increase in context switching: our cs counters have increased from 1,000 to 10,000 every 2 seconds, and users are telling us that things are slowing down. We don’t see anything else in our performance data other than high cs rates, so what’s happening to our system’s performance and what can we do about it?
The Timeslice Parameter
When a system’s cs rate is too high, working threads are bumped off CPUs before they’ve had time to complete their tasks, and the system’s logical CPUs rapidly turn their attention from one thread to another. Simply put, they change their context… much too quickly. We need to get back to a state where our working threads stay on an LP long enough to do everything we expect of them. Fortunately, a CPU tunable in AIX lets us adjust this time; it’s called “timeslice.”
The timeslice parameter tells us how many clock ticks of a CPU can happen in any 10-millisecond period. If we raise the timeslice tunable’s value, we increase the number of CPU clock ticks that occur in 10 milliseconds. So you can see that if our working threads initially have x-number of clock ticks to run on a CPU within 10 milliseconds, and we increase the number of clock ticks to 10x, 100x, 1,000x, etc., our threads will have more time to run before they are taken off a logical CPU. We use the “schedo” command to determine how the timeslice value is set in our system. Schedo is short for “scheduling options” and is the suite of tunables that governs CPU performance in an AIX system. To view the timeslice value, use the “schedo –FL timeslice” command. You’ll see output like this:
lpar # schedo -FL timeslice
NAME CUR DEF BOOT MIN MAX UNIT TYPE
timeslice 1 1 1 0 2G-1 clock ticks D
We start raising the timeslice parameter conservatively, say, in powers of 10. So we first increase our timeslice from the default of 1 to a value of 10 using the following command (adjustments to the timeslice parameter are dynamic; they take effect immediately, without a system reboot):
schedo –o timeslice=10
Now we return to our vmstats. Has our rate of total context switching gone down? If we’re not satisfied with our cs rate, we continue adjusting our timeslice tunable upward to 100, 1,000, 10,000, etc., until we see our cs rate fall off. As the timeslice gets bigger, our cs rate should get smaller. When we’re satisfied that our cs rate no longer hampers our system’s performance, we stop tuning the timeslice.
Use Timeslice Wisely
As I said, there are many reasons a system’s cs rate may be too high. With many reasons come multiple cures, but we have to start someplace. Short of a code change, adjusting the timeslice is the quickest way to rid yourself of too-frequent context switches. Of course you need a complete picture of your system’s performance if you intend to adjust any CPU tunable, timeslice included; utilities you can run safely over extended periods (like the “stats” programs: vmstat, iostat and netstat) as well as snapshot diagnostic tools like PerfPMR will give you this picture.
A final caveat: Use caution if you’re adjusting timeslice in a production environment, and by no means should you raise timeslice if your system is already performing well. Timeslice isn’t a performance-enhancer. Its only proper use is for fixing very specific problems, like excessive context switching. It shouldn’t be adjusted for any other purpose.