IBM Power/NASA Collaboration: Seeking Details
Rob McNelly has questions about Power servers' involvement in the lunar mission, plus answers on updating VIOS from the HMC and tips from IBM Support
I recently received this interesting though brief post from the IBM Hybrid Cloud and Infrastructure LinkedIn feed. I found it light on details and want to know more:
“IBM Power servers ran at the core of Artemis II’s Launch Control System at NASA’s (National Aeronautics and Space Administration) Kennedy Space Center.
“The system processed and monitored hundreds of thousands of data points tied to telemetry, sensors and overall system health in the rocket and spacecraft. From early testing through the final T-10 countdown, IBM Power helped engineers quickly identify any data that moved outside expected ranges. That insight enabled confident decisions and helped keep launch activities moving forward.”
I’d certainly appreciate some detail. What server model and operating system were they running? My searches didn’t produce any definitive information. I did find this, which describes the spacecraft’s fail-silent architecture. And this comment thread cites PowerPC and theorizes about this being a new variant of the RAD750 architecture, but since the article talks about running at Kennedy Space Center, I imagine two different things are being discussed: one on the spacecraft, and one on the ground.
I’m just speculating at this point. If anyone has detailed information, please send it my way.
Updating VIOS Servers From the HMC
Recently I was in a situation where an IBM i admin wanted to update VIO servers from the HMC.
Since VIOS is AIX under the covers, AIX admins were called in to help. It’s difficult to describe what we were seeing, but basically the GUI displayed an animation of a blue line going back and forth across the screen. We were scratching our heads. Was anything actually happening? Did the browser hang? Were we 95% done? 5% done? Who knew?
Luckily we were able to ssh into the VIO server and run:
tail -f /home/padmin/install.log
This gave us some insight into what was happening, but I felt badly for any non-AIX/Unix admin that could be left with no choice but to rely solely on the feedback from the browser. Without the ability to check the logs, we would have been left with a very unsettled feeling.
IBM.SoftwareRM APAR and More From IBM Support
More recent alerts and updates from IBM Support:
* “High Impact/Highly Pervasive APAR IJ57286: IBM.SoftwareRM subsystem may kill all processes on an LPAR”
“The RSCT subsystem, IBM.SoftwareRM, has a memory leak and will run out of memory after running for some time. As the process approaches its memory limits, the error handling may result in unrelated processes on the LPAR getting killed. This can only occur on systems that are managed by an HMC connected to IBM’s Cloud Management Console (CMC), and that CMC is run as part of Power Enterprise Pools (PEP) 2.0. We have typically seen this after IBM.SoftwareRM has been running for 2-3 months without restart.”
A recent Chris Gibson email provides additional information:
“For customers who have SoftwareRM installed, but are not sure of their extended environment, you can verify your exposure by checking this command:
# /opt/rsct/bin/getRTAS | grep RMCLppInfo
If you get no output, then that LPAR is not at risk.
Results will be a string of assignments with the “RMCLppInfo” visible somewhere, like this:
HscHostName=___;RMCLppInfo=C:1,F:30;HscIPAddr=___;
-------------------
If the assignment is “C:0” then the LPAR is NOT at risk of this HIPER.
If the assignment is “C:1” then the LPAR is affected and needs this fix. (The “F:” value is irrelevant)”
* shientdd driver network connection timeouts: “The shientdd driver logs SHAENT_SW_ERR and reset the adapter when it tries to transmit the packets with largesend enabled with small MSS going from vio client lpar to the remote host.
Conclusion: “Using an odm tunable we can either drop packets with mss size less than 64byes or adjust the MSS to the minimum value supported by the adapter.”
* For those involved with IBM i administration, this short but sweet tip might help:
“The 5250 console provided by the HMC (local or remote) does not support 132-column display. This is a restriction of the 5250 proxy function on the HMC.”
* This link to a readme suggests an HMC best practice that you might not be familiar with:
“User sessions: The following best practices helps avoid performance degradations gradually over a period of time due to increased login sessions as well as security vulnerabilities such as unauthorized access to the active HMC sessions. It is a best practice to logoff from HMC UI and then close the browser tab instead of directly closing the tab. Set Idle session timeout for all the users and not leave the timeout as ‘0’ which leaves it as no timeout.”
* PowerVM/VIOS: Stopping error log messages on unused fibrechannel ports.
“Usually when there are unused ports on FC adapters, it is possible to disable those ports on the adapters and stop cfgdev/cfgmgr from configuring the devices, and this will stop all the error log messages.
“Objective: Prevent “FCA_ERR12: 29FA8C20” and “FCA_ERR4: 7BFEEA1F” errors.
“The steps and information provided… are intended to disable ports that are not connected and are intended to be not connected, if the errors are logged against ports that should be connected or should be in the “Available” status, please troubleshoot those adapters accordingly.
See the attached script.
* SEA Sharing mode not working after update to 4.1.2.0. Note: IBM registration is required to access this document.
“SEA with dedicated control channel VEA doesn’t negotiate the sharing mode after update to 4.1.2
This document has the abstract of a technical article that is available to authorized users once you have logged on. Please use Log in button above to access the full document. After log in, if you do not have the right authorization for this document, there will be instructions on what to do next.”