The Art and Science of Performance: Firmware, History and More
Performance tuning is the greatest art in computer management. How to get peak performance out of your existing AIX systems. Part 1 of 4.
Over the past two years, I’ve shown you some advanced techniques for analyzing AIX system performance and fixing the performance issues you uncover. The response to these articles has been heartening; it seems many of you have put these techniques into practice, with beneficial results.
However, the reader emails I receive typically consist of two parts. The first part details the successful implementation of suggestions from my articles―and again, this is nice to hear. As for the second part, it generally takes the form of a question, like: “How can I develop a methodology to attack my performance problems from the time I discover them, through analysis and remediation?”
Without a doubt, performance tuning is the greatest art in computer management. But the sad truth is that there’s virtually no documentation that explains how to get peak performance out of your existing systems, from within your current hardware and software configurations. This article, the first in a series, is my attempt to rectify this.
We will, of course, confine ourselves to AIX system performance, but the principles I’ll delve into can be applied to most any computer, because all computers are constructed the same way: they have a certain number of CPUs, along with some quantity of memory, and they’re supported by networking and storage devices. Even the new field of quantum computing follows this basic architectural model. The main differences come in the software that computers run―the operating system foremost, but also databases, applications, middleware and utilities.
These commonalities even apply to computer documentation. I studied a handful of system performance manuals for several different operating systems, not just AIX, and found that these docs do a reasonably good job of covering the basic commands used in performance analysis. Unfortunately, there’s next to nothing about intermediate techniques of performance analysis and remediation. Even more striking, there’s utterly nothing about advanced techniques. In short, what’s needed is a soup-to-nuts, holistic methodology of dealing with the performance issues we all encounter. And this is what I’ll attempt to provide.
In this series I’ll detail my method of attacking and fixing performance problems. I’ll share how I think about performance and provide a roadmap to get you from diagnosis through remediation. I’ll explain how I use many of the performance tools that are supplied for AIX―and then tell you how to use them in different ways. Ultimately, I’ll encourage you to develop your own perspective on these matters, using your unique roadmap to navigate your own environment.
One final thing before we dive in: Please give some thought to the title of this article, “The Art and Science of Performance.” For sure, the scientific method is prominent when it comes to observing and evaluating systems with less than desirable performance. You’ll take logical steps that lead to your goal. You’ll formulate a hypothesis as to what’s wrong and how to go about fixing it. You’ll then test your hypothesis by various means, altering your thinking on the subject as required until you resolve the problem.
Further, you’ll encounter inviolate rules when analyzing your problem, like a CPU’s timeslice or dispatch wheel. You’ll deal with finite resources like memory and learn the limits of performance tuning. You’ll follow, as you should, many rules of IBM Power Systems hardware.
But when it comes to attacking performance problems that impact your hardware and everything it runs, it’s essential that you understand this: in performance analysis and tuning there are no rules. Performance is science, but it’s also an art. Everything is open to interpretation. All that matters is what works in your particular situation.
The Importance of Firmware
So, where to begin? How about if I get into the philosophy and method of performance analysis and remediation for the four main subsystems I just mentioned: CPU, memory, networking and storage.
I will tell you all about this―but not just yet. First, I need to point out another thing that’s missing from those performance manuals I looked through. It’s a gap I’m at a loss to explain. It’s firmware. Whatever you call it―the BIOS, microcode or firmware―the labels all represent one thing: embedded software that allows you to maximize performance and stability with all your systems, be they Power Systems servers, mainframes or PCs. Firmware is important because it’s the area of systems management that can actually prevent many problems before they become performance issues.
Updating and maintaining current firmware levels in your AIX systems should be a routine part of every administrator’s job. But many things can thwart that routine, the biggest of which is downtime. Many of your sites are running applications that require 24 X 7 availability. Taking a system down for firmware maintenance is something management may not tolerate. The alternatives that allow you to conduct firmware maintenance without incurring downtime may not be available to you, either. You may not have Live Partition Mobility (LPM) setup to move your production LPARs from one frame to another while your application is live. You may not have a high availability mechanism in place to fail an LPAR to another frame so that downtime is minimized.
The problem, especially for smaller shops, is that firmware simply isn’t high on the list of management concerns. If this is the case in your shop, you need to take action. You need to become an educator as well as an administrator and explain to those in charge that firmware maintenance is vitally important. If firmware isn’t properly installed on your Power/AIX system, all your insight and all your tuning efforts won’t do a bit of good.
Think of firmware as a suite of enablers. It allows you to utilize the full feature set of your hardware, and not just in performance. The latest firmware enhances system security, fixes hardware bugs (along with bugs present in prior firmware releases), and lets you expand the capabilities of your hardware beyond those present on the machine when it shipped from the factory. Even with all that, I believe firmware’s greatest value is found in the performance realm. I can illustrate this with one broad example: A few years ago, AIX underwent a major revision from version 6 to version 7. Of course, we’re all aware of this; each of us has probably done dozens, if not hundreds, of v6-to-v7 upgrades. But did you upgrade your firmware along with your operating system?
It goes without saying that a major AIX release introduces all manner of key performance updates in each subsystem: CPU, memory, networking and storage. But there’s a catch to this wonderfulness, and it’s a big one. At countless customer sites I’ve visited, I’ve found that down-level firmware was the sole reason that most―if not all―performance benefits of an AIX upgrade were going unrealized. And with some older Power hardware models, out of date firmware can actually destabilize your environment. That’s right: older firmware may not even support the latest version of your operating system.
Similarly to my reader emails, these communications also take a familiar form:
Me: “Did you upgrade your firmware, preferably before you upgraded AIX?”
Team of flustered admins who can’t understand why their databases and applications are performing worse under the new AIX version: <Agonizingly long silence.>
As any builder will tell you, it’s about the foundation. Every house needs one. With Power/AIX or any other computing environment, the foundation is firmware. So make sure you have current firmware installed on your systems. You have to start there.
In AIX, it’s easy to determine the level of firmware you’re currently running. Just use the lsmcode command. As root, run “lsmcode –c” from a command prompt. Your output will look like this:
lpar # lsmcode -c
The current permanent system firmware image is AL730_146
The current temporary system firmware image is AL730_146
The system is currently booted from the temporary firmware image.
The above is the result of an lsmcode –c taken on a POWER7 system. You see several lines of output, telling you which side contains which level of firmware and from which side your system has been booted. By side I simply mean the firmware image. In Power/AIX, you have two images: the temporary and the permanent. The names are counter-intuitive. Usually, your system will be booted from the temporary side or image. The temporary side is the image you use for testing new firmware; this image can be rolled back if something doesn’t go as planned. Once you commit firmware to the permanent side, a rollback becomes exponentially more difficult; you’ll most likely need an IBM SSR to come out for a difficult and costly repair.
Note that in the above example, this POWER7 system is running firmware version AL730_146, which came with the system when it was delivered in 2014. The most current version of this system’s firmware is AL730_154, which shipped in August 2017. Three years is a long time to go without a firmware upgrade. Now, I understand that, for a lot of customers, being current on firmware isn’t realistic. Sometimes management doesn’t want to be on the bleeding edge of a new firmware release. I’ve seen that quite a few times over the years. Even IBM recommends waiting 1-3 months after any particular firmware package is released before installing it; that’s enough time for them to spot any buggy code.
My recommendation: You should never be more than two service pack levels back of current firmware. So make the decision on which version of firmware is best for your site, download it from IBM’s Fix Central and install it according to the instructions. Sometimes, you’ll also need to update your HMC code levels with your system firmware. And remember that these days, not all firmware upgrades require an outage. So know what you need and what you’re getting, and don’t be afraid to involve IBM support, especially if you’re applying firmware for the first time.
I could write a great deal more about firmware, but instead I’ll urge you to check out this IBM webpage. It succinctly describes practically everything you need to know about the proper installation of firmware for your systems. Please read this. Better yet, commit this information to memory.
That’s it for firmware―sort of. What I just described is system firmware. But there’s also device firmware. Firmware isn’t just for computers. Many of the devices that attach to your systems also have it, including storage adapters, network cards, and specialty devices like graphics processors, sound manipulators and artificial speech devices. All this firmware must be kept current as well.
If all the devices connected to your system are IBM products, great. You can get all of your device firmware at Fix Central. Of course, even the staunchest IBM shops generally have products from other manufacturers. So if your storage is from a vendor like EMC or Hitachi, you’ll need to get your device firmware from them. Same thing for network adapters and those high-end graphics processors. Just get into the habit of updating your device firmware at least as frequently as your system firmware. Again, this is your foundation.
The Art of the History
In medicine, no diagnosis can be made until the physician fully understands the conditions that led to a patient complaint. You say, “Doc, I got a pain right here,” and you point. But your doctor isn’t just going to nod and write you out a prescription for a pain-killer. He or she will ask a number of probing questions before you stretch out for an examination.
Machine diagnosis is similar to human diagnosis, in that they follow the same logic. And while both are firmly grounded in anatomical or computer science, developing a holistic picture of the condition of our computer patients requires an artful and individualistic approach. So like the physician, your examination of any performance problem must begin by compiling a detailed history.
Very rarely will you get the whole story of a performance problem with one phone call or email. Usually it’s a team effort, with application specialists, database administrators and developers all contributing their opinions. Of course, some problems are extremely urgent: performance issues that impact the bottom line must be resolved quickly. Nonetheless, you still need a thorough history of the problem at hand. This step can be rushed, but it can’t be skipped.
The typical chain of events is you’ll get a call, an email, or perhaps a visit from a colleague saying “things are slow.” Describing performance issues completely and effectively is a challenge even for techies, so until you fully understand the problem, keep your mouth shut. A common mistake among neophyte performance practitioners is that they’ll hear the first few words describing a performance problem and jump the gun. I was guilty of this when I started out nearly 20 years ago. I didn’t listen before I spoke, and in my first year my mistakes caused a pair of crashed databases and one frozen application; lots of people looked darts at me in meetings because my rushed diagnoses ground their business to a halt. Believe me, you don’t want to go through those types of experiences.
So listen and digest that first report, and on top of that, take copious notes. Write any questions you have in your notes and save them until the person who is providing the description has finished. Only after you’ve asked, at least twice, “Is there anything else you can tell me about the problem?” should you move onto your own list of questions.
Once your questions have been answered, you’re ready to get a broader perspective on the issue. It’s time for a meeting or teleconference, one that should include specialists from every department in the enterprise. Is an application’s response time slowing down? Don’t just ask the app people. Make sure the database, network, storage and security folks are involved. It wouldn’t hurt to rope in some end users, either. Is the database unresponsive? Ditto. Have network transfers slowed or stopped altogether? Same deal.
You’ll be surprised how many problems emerge from an unlikely source. Personally, I can’t count the number of times I’ve investigated application or database slowness, only to discover that the security people installed a patch that brought everything to its knees.
As with your initial one-on-one, the best thing you can do until everyone else has had their say is listen. Take notes or even record the conversation. I carry a pocket recorder and use it as a digital secretary for these occasions.
Again, only―and I mean only―when your understanding of the situation is as complete as it can be, should you ask your questions. Ideally, you’ll have many. Here are some old standbys:
- When did the performance problem start?
- Were any modifications made to the system around the time the problem started?
- Does the problem only occur at a certain time of day or when the workload is performing a particular task?
- Is the problem reproducible?
These general questions can get the ball rolling on a good, technical explanation of the problem. Each question you pose should lead to the next question, and ultimately, to your plan for remediation.
And while you’re at it, make sure you have both a prtconf and a config.sum from PerfPMR in front of you. That way you can begin to create a mental image of the problem and the system it’s running on.
Finally, whatever you do, don’t say you think you know the cause of a performance problem until you’re reasonably certain you are correct. And don’t offer a fix: that will only come after you’ve done your own analysis―as extensive as time permits―and you have as many facts available to you as circumstances allow.
And that is the art of the history.
In the next installment in this series, I’ll take you through performance diagnoses, from config to stat to mon to trace. In the meantime, email me any questions you may have.