Understanding and Tuning CPU Throughput

Mark J. Ray June 14, 2017

Few topics related to AIX performance analysis and tuning are more misunderstood than CPU throughput: how databases, applications, middleware and utilities use CPUs in an IBM Power Systems environment. What follows is a basic primer on determining CPU throughput and usage in your AIX systems. By combining this knowledge with performance statistics, you can tune your systems for the best performance.

Raw Versus Scaled Throughput

Power/AIX systems have two modes of CPU throughput: raw and scaled. Which mode is utilized in your systems at any given time is dependent upon on two factors: the default behavior of CPUs in the AIX environment, and how your executable is programmed for CPU usage.

Understanding the default behavior of CPUs in AIX requires a basic understanding of CPU architecture. Power Systems physical CPUs are partitioned into two or more hardware threads that are mapped to the same number of logical processors on a virtual CPU. This quantity depends upon the machine implementation type. Each physical CPU in a POWER5 or POWER6 system has two hardware threads on a physical CPU that are mapped to two logical processors on a virtual CPU, while a POWER7 CPU has four hardware threads mapped to four LPs and a POWER8 CPU has an eight/eight scheme.

Each of the hardware threads on any given CPU is named in an order of precedence. For example, consider a POWER7 CPU. The first of the four hardware threads in a physical POWER7 CPU is called the primary thread; the other hardware threads are called sibling threads, collectively, and are further distinguished individually as the secondary, tertiary and quaternary threads. Alternatively, you can apply numbers to these hardware threads: 0-3 for the first POWER7 CPU in an AIX system, 4-7 for the second CPU, 8-11 for the third, and so on. Again, each of these hardware threads is mapped to a logical processor on a virtual CPU; it’s the virtual CPU to which working threads are bound. The Power Hypervisor then dispatches the virtual CPUs to run on physical CPUs where the threads actually do their work.

Okay, with me so far? The default behavior in AIX systems is for threads to utilize the primary hardware thread on any given CPU. When this hardware thread is saturated with work, that workload falls over to the next primary thread on the next CPU. The workload does not take advantage of the sibling threads in a CPU. This is called raw throughput mode. Most databases (Oracle, Sybase, Cache, etc.) use this scheme; they opt for the greatest throughput on any CPU versus better utilization of all the hardware threads on that CPU. In this way, most I/O operations (storage or network) are completed faster with this raw throughput method.

Contrast this with the other mode of operation: the scaled throughput mode. With scaled throughput, both the primary and sibling threads of any given CPU are activated to do work. Only when all four hardware threads are saturated (referencing our previous sample POWER7 CPU) will the workload fall over to the next CPU. Many application vendors choose scaled throughput to handle multiple concurrent calculations.

So how do you determine whether your workload is using raw or scaled throughput? Use “mpstat.” This stands for “multi-processor statistics,” and not enough administrators know about it. It’s yet another of those extremely useful tools that gets overlooked even though it ships with every AIX distribution. While utilities like vmstat and iostat display aggregate CPU usage statistics, mpstat lets you evaluate the load on every logical processor in your system. With mpstat output, you can not only tell at a glance if the workload in your system is doing raw or scaled CPU throughput, you can also see the load each LP is under.

Let’s look at samples of mpstat from each type of workload. First, here’s an LPAR with a workload using raw throughput. (I’ve omitted some lines for brevity):

lpar(/)#mpstat -w 3


cpu    min    maj    mpc    int     cs    ics     rq    mig   lpa   sysc    us    sy     
 wa       id    pc   %ec   lcs
  0      5      0      0    206    129      1      1      0 100.0    185  	 
60.0  40.0   0.0  0.0  0.00   0.0   160
  1      0      0      0     11      0      0      0      0     -      0   		 
0.0     6.0    0.0    94.0  0.00   0.0    11
  2      0      0      0     11      0      0      0      0     -      0   		 
0.0     4.6    0.0    95.4  0.00   0.0    11
  3      0      0      0     11      0      0      0      0     -      0   		 
0.0     4.3   0.0    95.7  0.00   0.0    11

Focus on the us and sy columns. If you’ve read my previous articles, you know that “us” stands for “user time” and “sy” for “system time.” These refer to the amounts of time a logical processor spends doing work on behalf of applications and the AIX kernel, respectively. In this sample, the workload for four logical CPUs, numbered 0-3, is displayed. Note how LP 0 is getting almost all of the work, with user and system times of 60 and 40 percent, respectively. Now look at the other LPs. See how they’re almost completely inactive? This is the typical default CPU behavior in AIX raw throughput. Most of the time when you evaluate mpstats, you’ll see this type of LP usage.

Now, let’s look at scaled throughput, again using mpstat output:

lpar(/)#mpstat -w 3


cpu    min    maj    mpc    int     cs    ics     rq    mig   lpa   sysc    us      sy      
wa       id    pc   %ec   
  0      5      0      0    206    129      1      1      0 100.0    185  	 30.0    
10.0   0.0  0.0  0.00   0.0   
  1      0      0      0     11      0      0      0      0     -      0   		 
20.0     6.0    0.0    94.0  0.00   0.0    
  2      0      0      0     11      0      0      0      0     -      0   		 
15.0     4.6    0.0    95.4  0.00   0.0    
  3      0      0      0     11      0      0      0      0     -      0   		 
18.0     4.3   0.0    95.7  0.00   0.0

The differences should jump out at you. Again, focus on the us and sy columns. In this output, the workload is spread out over every logical processor in this system. The developers of the application that runs on this LPAR have sacrificed a bit of speed in favor of having a number of working threads (in this case, a type of numerical calculation) completing their activities in parallel. This is a good example of scaled throughput.

To see for yourself what type of CPU throughput your databases, applications, middleware and utilities are using on your own systems, use this invocation of mpstat:

nohup mpstat –w 60 11000 > lpar_name_date_started &

This is a typical invocation, and the one I generally use in my performance work. Of course you can use the combination of mpstat flags that works best for you. Here, we set up mpstat to run continuously for about a week, taking its data samples once every minute. We write all of mpstat data to a file, making sure we background the process with the “&” sign.

Again, mpstat is invaluable as both a performance diagnostic aid and capacity planning tool. Say you have a typical database LPAR with ten entitled (or physical) CPUs, and you need to set up several more instances of that database. Let’s also assume your LPAR is currently running at above 85 percent CPU capacity. You’re going to need more CPUs to handle your projected workload, right? But how many? mpstat is a big help in making this determination. Obviously if your database runs in raw throughput mode (as noted, most do), you’ll need a lot more entitled CPUs than if it’s operating in scaled mode. Of course, every performance specialist knows what it’s like to go to management and explain why more processing power is needed when the current workload is using only 25 percent of its CPUs (sticking to our raw throughput example with POWER7 CPUs). But in this case, you really need to do your homework, and then do some basic teaching. Tell your higher-ups how AIX utilizes Power Systems CPUs, and include extensive mpstat studies of the subject workload. Hopefully your well-reasoned and fact-based projection of future CPU needs will be well-received and acted upon.

Tuning Tips

In AIX, there are several CPU tunables that affect how a workload uses those CPUs. In most cases, you shouldn’t need to tune these settings, but in some instances turning these knobs will be essential. For instance, knowledge of raw and scaled CPU tuning is critically important if you’re building resource sets. RSETs are a way of segregating workloads in LPARs, sort of like building a virtual machine within a single system. I wrote about this here.

There are other scenarios where knowing how to tune throughput is necessary. Suppose you’ve given management all the data you have that clearly spells out the need to make a substantial investment in CPU entitlement, and their response is “no”. There isn’t room in the budget to buy more CPUs. So what do you do now? You get familiar with SCHEDO and the raw and scaled throughput options it contains.

There are six sets of tuning switches in AIX: the VMO (memory), SCHEDO (CPU), IOO (storage), NO (network), NFSO (NFS), and RASO (kernel) tuning sets. It’s very likely that you have at one time or another experimented with some of these switches in all of these categories. Two of the SCHEDO (or “scheduling options”) tunables that govern raw and scaled CPU throughput fall within the context of this article: vmp_throughput_mode and vmp_throughput_core_threshold. Detailed explanations follow:

* The vpm_throughput_mode tunable adjusts the level of SMT “exploitation” on each CPU. The default value is 0, which effectively disables this option. When run at default, your LPAR will conduct all of its business in raw throughput mode. Consider that scenario I just described, where you have no choice but to add workload to an LPAR without giving it more CPU horsepower? That’s a time when it may be necessary to tune vpm_throughput_mode. The higher you set vpm_throughput_mode, the more threads will be activated on each CPU before another full CPU is unfolded―or activated. Remember that raw throughput seeks to activate another physical CPU after only the primary thread on the first CPU becomes saturated. Adjusting the vpm_throughput_mode tunable will force more threads/logical processors to be utilized before the workload falls over to the next CPU. The maximum value for vpm_throughput_mode is 8, which corresponds to the number of hardware threads and logical processors in a POWER8 CPU. So the upshot is that you can start with your CPUs in raw throughput mode and gradually increase each CPU’s utilization until you’re running in fully scaled mode. So the upshot is that you can start with your CPUs in raw throughput mode and gradually increase each CPU’s utilization until you’re running in fully scaled mode.

vpm_throughput_mode is a dynamic setting and can be adjusted on the fly like this:

schedo –o vpm_throughput_mode=X

Here, X is the desired level of SMT exploitation. (Note: A value of 1 still enables raw throughput mode, but a newer CPU folding algorithm is used.)

* The vpm_throughput_core_threshold tunable determines the number of CPUs that will be unfolded―or activated―in raw throughput mode before you switch your system to scaled throughput mode. Say you’ve set your vpm_throughput_core_threshold to a value of 10. That means whenever the primary thread of a CPU gets saturated, the workload will fall over to the next CPU up to a value of 10. When all ten primary threads are saturated, the CPUs will switch to scaled throughput, and work with the value you set with the previous tunable, vpm_throughput_mode.

These tunables work in tandem, so never adjust one without taking the other into consideration. And of course don’t adjust either unless you have a thorough understanding of your system’s workload and how it uses CPUs. My advice is to practice tuning these values on a sandbox system. These tunables are too valuable in too many instances not to use effectively, but of course my standard disclaimer applies: Under no circumstance should you adjust vpm_throughput_mode or vpm_throughput_core_threshold in production without complete validation and testing in a development or test environment with comparable workloads and configurations.

CPU Secrets Uncovered

As I said at the beginning, how workloads use CPUs in an AIX-based Power Systems environment is too-often misunderstood. I believe this misunderstanding is quite literally shared by almost 100 percent of AIX admins. In truth, I worked on AIX systems for years before I learned these lessons. So I’ve been there.

Knowing how to effectively maximize your enterprise’s CPU investment will save your business a whole lot of money. But beyond that, tuning CPU usage is a fascinating topic. I hope I’ve piqued your interest, because I’m not done writing about this. Future articles will uncover many more CPU secrets!

Comparing Raw and Scaled Throughput

Raw Throughput Mode

Default behavior in AIX systems.
Only primary CPU threads are (fully) utilized.
Provides greatest throughput to optimize performance of I/O operations.

Scaled Mode

Primary and sibling CPU threads are utilized.
Only when all hardware threads are saturated does workload fall to the next available CPU.
Offers the greatest CPU utilization and parallelism.

Understanding and Tuning CPU Throughput

Raw Versus Scaled Throughput

Tuning Tips

CPU Secrets Uncovered

Comparing Raw and Scaled Throughput

Related Articles See more

SMT Data Releases ITBI Portal v2—Next-Generation Mainframe Performance Optimization Platform

Power11 Models Under the Hood: From Scale-Out to High-End

Keeping Agentic AI in Line Through Integration of Governance and Security