CPU Threading Efficiency: How to Improve L2/L3 Cache Hits
Executing a thread or task quickly isn't the same as executing the same efficiently. The missing factor is CPU threading efficiency – specifically, better feeding the CPUcore.
Traditionally, there's been an overlooked aspect of gaining better performance: Improving L2/L3 cache hits.
For the most part, AIX performance specialists have focused on executing threads with the greatest possible performance. While this is obviously important, simply having threads execute quickly isn't enough, because this approach doesn’t consider the rate and latency of L2/L3 cache misses. To address this, attention must be paid to keeping L2/L3 cache content undiluted by configuring to maintain fewer virtual CPUs of different LPARs on a given CPU core. As short-hand, I refer to this as "CPU threading efficiency."
Relative to the count of CPU cores, a workload can exhibit fewer threads running strenuously, or many threads switching rapidly, or a varying blend of both. This series of articles will define these characteristics, highlight how the POWER8 core improves workload performance, discuss POWER8 SMT-1/2/4/8 mode considerations, and offer models of CPU core threading for efficiency, throughput and performance.
(Note: The intent with these articles isn't to examine the physical POWER8 CPU complex. Rather, the focus is narrowed to the characteristics of a single CPU core. To make a clear distinction, I use the term "CPUcore" to refer to one core on a POWER8 system.)
CPU Under-Threading―and a Bit About Over-Threading
CPU under-threading and over-threading are two distinct problems. Under-threading is wasteful, expensive and rarely justified. Over-threading is frustrating, expensive and difficult to recognize. While neither is likely to cause a system outage, both extremes add overhead to and degrade the potential productivity of the entire POWER8 system. (Of course, either can be warranted for certain circumstances and workload characteristics, but that discussion is beyond the scope of this article.)
Under-threading is a sustained state of maintaining too few executing threads across too many CPUcores of a given LPAR. It's wasteful of CPUcores that would otherwise be used by other LPARs. Under-threading is encouraged when enterprises purchase more software CPU licenses than are needed.
Over-threading is a sustained state of maintaining too many executing threads across too few CPUcores of a given LPAR. It's wasteful of CPU cycles because of the overwhelming overhead of too many threads concurrently loading/storing instructions and data (called load/store unit overhead or LSU overhead, addressed below). Over-threading is often induced by buying an insufficient number of software CPU licenses for too few CPUcores.
(As you can probably imagine, of these two scenarios, under-threading is the far more common problem. For this reason, this series of articles will have little more about over-threading, though I do intend to return to this topic in the future.)
CPUcores can only run efficiently based on how they are being fed thread instructions and data from L2/L3 cache. This cannot be overstated. CPUcores can only execute threads residing in L1 instruction cache (called L1-I) and L1 data cache (called L1-D). Thus a primary priority is the load/store of the L2/L3 cache feeding the L1 cache, and these activities are managed by the CPUcore’s LSU circuitry. Threads are instructions that generally process streams of data. Thread instructions must reside in CPUcore L1-I instruction cache before being dispatched to CPUcore execution pipelines. The loading of thread instructions and data into CPUcore cache is LSU overhead; the thread is not dispatched for CPUcore execution unless instructions are L1-I ready. All threads suffer LSU overhead because all threads load instructions and data for processing, and store results.
The hypervisor owns and underlies every LPAR, translating everything going to Power Systems hardware – including LSU overhead. The hypervisor has a higher urgency for CPUcore attention versus AIX root user. Relative to AIX user/kernel workloads, the hypervisor is lightweight and so efficient that performance specialists seldom think of it. That said, it's important to consider the hypervisor. Giving it excess work is bad for overall efficiency; giving it less work is good for overall performance. And yes, we can give the hypervisor less to do. This is an innate benefit of CPU optimal threading.
Extending the Duration of Thread Performance
Let's now consider today’s trend of large-scale long-duration workloads (e.g., Big Data and analytics applications), because these workloads often process disproportionately substantial concurrent streams of data for hours or even days at a time. For such workloads using shared processor LPARs, pervasive CPU under-threading greatly encourages sharing CPUcores with other shared processor LPARs. This dilutes the hit ratio of instructions and data in the L2/L3 cache of all SPLPARs assigned to these shared CPUcores and could lead to erratic thread performance.
If you stop to consider your own environment, the previous sentence is either completely relevant or utterly nonsensical. Why? It's a matter of scale.
If your shared processor LPARs house light workloads of fewer or shorter duration threads on small-scale POWER/AIX systems, you are easily running with exceptional performance and efficiency out of the box; thus, the above statement is nonsensical by your experience. However, if your shared processor LPARs house heavy workloads of many longer duration threads on large-scale E850/E870/E880 enterprise-class systems, the above statement is huge by your experience, and the information I'll share in this series of articles is written specifically for you. Again, the difference is the scale of workload and LPAR configuration.
As noted, CPU under-threading dilutes the hit ratio of instructions and data in the L2/L3 cache of its attending CPUcores. When virtual CPUs of different LPARs too often share the same CPUcores, they dilute the L2/L3 cache content of every other LPAR. In contrast, while a CPUcore is executing a thread, there is a duration of exceptional thread performance for as long as thread instructions and data are readily available. As we continue with this series I'll discuss several tactics that can improve CPU threading efficiency.
Most of us configure a count of virtual CPUs that is greater than the count of running threads on the runqueue. When all shared processor LPARs are configured with more virtual CPUs than executing threads on the runqueue, then all shared processor LPARs are configured for pervasive under-threading. This was commonly done with POWER5/POWER6 systems, but customers suffered no performance degradation with these earlier Power architectures. However, extending this practice to POWER7 and POWER8 isn't recommended. Check out the sidebar below for more information.
Today, the missing factor is CPU threading efficiency – specifically, better feeding the CPUcore. Executing a thread or task quickly isn't the same as executing the same efficiently. Performance and efficiency are two different traits. When a given single thread is genuinely executing on a CPUcore, it must have its instructions in L1-I (L1 Instruction Cache) and data in L1-D (L1-Data Cache) – or it would not be executing. As long as instructions and data are in CPUcore L1/L2/L3 cache, the thread executes with the greatest possible performance.
Said another way: In POWER7/POWER8, only CPUcore:L2 cache feeds instructions and data to the CPUcore:L1 cache. When the L2 cache doesn’t have the needed instruction/data, the L3 cache is searched. If not found in L3 cache, the system has to fetch it from elsewhere (i.e. from other L2/L3 cache, main memory, attached PCI devices, etc.). While the system is fetching an L2/L3 cache miss, how quickly is this CPUcore executing its single thread? Oops. It isn’t executing this thread at all -- because it's delayed by the L2/L3 cache miss to feed the CPUcore:L1 cache.
The rate and latency of L2/L3 cache misses interrupts the duration of exceptional thread performance. Improving efficiency is merely reducing L2/L3 cache misses for longer durations of exceptional thread performance. In this fashion, efficiency extends performance for improved throughput – this is the missing perspective.
Tactics for Realizing Efficiency
CPU threading performance is not the same trait as CPU threading efficiency. Improving CPU threading efficiency aims to extend durations of CPU threading performance within an interval. In this fashion, efficiency extends performance for improved throughput. Efficiency here means merely ensuring the content of CPUcore:L2/L3 cache is kept as focused as possible. But, what are the tactics for realizing improved CPU threading efficiency? I'll discuss this in the next installment of this series.
Sidebar on Under-Threading
When customers configure too many virtual CPUs, the hypervisor by default assigns one thread to execute on one virtual CPU, which then executes on one CPUcore. This assignment of executing one thread:vCPU:CPUcore is called a dispatch of 1:1:1. Dispatch and assignment have almost the same meaning in this case.
Now remember: with under-threading,, there are more virtual CPUs than there are threads on the runqueue. In other words, there are too few executing threads for too many virtual CPUs. And many IBM customers recreate this routinely as established with POWER5/6 practices.
When dispatched at 1:1:1, an executing thread will be the only thread executing on the CPUcore (called ST/SMT-1 for Single Threaded (ST) mode). As such, the thread is symmetrically balanced between both “sides” of the CPUcore. What most don’t know is there are two symmetrical sets of execution pipelines in each CPUcore. When one thread is divided and executed between both "sub-cores" (my term) of the same CPUcore, this thread is executing with the greatest possible performance for only as long as the L2/L3 cache can provide instructions and data. This is good for thread execution performance. But here's the paradox: L2/L3 cache is constantly diluted by the other virtual CPUs of other LPARs when configured this way – and the L2/L3 cache misses are shortening the durations of exceptional thread execution performance. Plus, the more pervasive and strenuous is the under-threading of LPARs, the worse the dilution effect becomes. By focusing too harshly on single-threaded performance, we are inducing a greater rate of L2/L3 cache misses.
In other words, a given CPUcore serves one thread per virtual CPU per LPAR at any moment; but within a 10msec interval, there are perhaps several virtual CPUs of several LPARs sharing the same CPUcore – and thus this is constantly diluting the L2/L3 cache upon each switch of virtual CPU on this CPUcore. Each thread executes quickly when executing but then this duration is quickly halted on a greater rate of L2/L3 cache misses.
Why bother with a focus on only exceptional thread performance that ceases too soon? We should also care about preserving/focusing/prefetching L2/L3 cache content as strenuously as possible for longer durations of exceptional thread execution performance – and not share the same CPUcore so often between disparate LPAR workloads (aka a habit of pervasive under-threading).
Again, in other words: Pervasive under-threading is self-defeating when the L2/L3 cache is overly diluted. And by merely configuring too many virtual CPUs per LPAR across virtually all LPARs, we recreate this scenario repeatedly and universally throughout the world-wide Power/AIX community. Yes, there is a duration of exceptional thread performance when/with a single thread per virtual CPU per CPUcore, but there is a constant/redundant workload of refilling the L2/L3 cache misses. This is the paradox.
Earl Jew is a certified expert (Level Two) IT Specialist and senior IT management consultant, IBM Power Systems and IBM Systems Storage.
See more by Earl Jew