Upgrading Your Storage Scale Cluster, Part 2
In part one of this two-part series, I discussed things you should be aware of prior to upgrading your Spectrum Scale cluster. In this article I will cover the actual upgrade process.
Preparing to Upgrade the Cluster From 5.1.7.1 to 5.2.1.1
The first step is to download the code and unpack it into a shared directory that is not in GPFS/Scale. I then take my backups and snaps. As I mentioned in part 1, I take an AIX snap and a gpfs.snap prior to starting the upgrade. I also take a mksysb backup to my backup server and I clone rootvg using alt_disk_copy before starting the upgrade—this is done on every node. The mksysb and copy give me two different recovery methods if there are issues with the upgrade.
Here is an example of the clone which is done on every node:
#lspv | grep root hdisk0 00ce48c00c98baf1 rootvg active hdisk1 00ce48c00cbd3762 altinst_rootvg #exportvg altinst_rootvg #alt_disk_copy -V -B -d hdisk1
The above clones rootvg to hdisk1. If I need to revert later, I can set the bootlist to hdisk1 and reboot. Alternatively, I can use nimadm to apply the upgrade to the alternate disk (hdisk1) instead of updating the live disk.
Documenting the Cluster
The cluster I am upgrading has 4 AIX LPARs in it. All 4 LPARs have the same disks mapped to them using NPIV from the VIO servers. The primary LPAR and one filesystem manager are on one physical server and are quorum nodes, the second filesystem manager and application server are on a second server with the filesystem manager also being a quorum node. Although all the luns are mapped to each LPAR, the NSDs were changed from direct attach to network. This was done as there used to be a bunch of Linux nodes in the cluster as well.
The cluster looks like this:
Server 1
gpfslpar1 Quorum node, runs apps and samba, FS manager gpfslparp Quorum node, FS manager
Server 2
gpfslpars Quorum node, FS manager gpfslpars1 Runs apps and samba
Prior to making any changes, I check various levels and cluster settings. I check the level of rpm, perl, python, ssh, ssl and java on every node, and I check samba on the SMB application nodes.
#oslevel -s 7300-02-02-2420
#lslpp -l | grep rpm 4.18.1.2003
#rpm -qa | grep rpm rpm-python3.9-4.15.1-64_4.ppc rpm-python3-4.15.1-64_4.ppc AIX-rpm-7.3.2.3-73.ppc
#lslpp -l | grep perl 5.34.1.6
#lslpp -l | grep -i python 3.9.19.3
#lslpp -l | grep ssl 3.0.13.1000
#lslpp -l | grep ssh 9.2.112.2400
#lslpp -l | grep ava 8.0.0.826
#rpm -qa | grep samba samba-client-4.18.9-1.ppc samba-devel-4.18.9-1.ppc samba-winbind-clients-4.18.9-1.ppc samba-4.18.9-1.ppc samba-common-4.18.9-1.ppc samba-libs-4.18.9-1.ppc samba-winbind-4.18.9-1.ppc
After checking that these are up to date on all four nodes, I then document the actual cluster, starting with the level installed.
#lslpp -l | grep gpfs gpfs.adv 5.1.7.1 COMMITTED GPFS Advanced Features gpfs.base 5.1.7.1 COMMITTED GPFS File Manager gpfs.compression 5.1.7.0 COMMITTED GPFS Compression Libraries gpfs.crypto 5.1.7.1 COMMITTED GPFS Cryptographic Subsystem gpfs.gskit 8.0.55.19 COMMITTED GPFS GSKit Cryptography gpfs.license.adv 5.1.7.0 COMMITTED IBM Spectrum Scale Advanced gpfs.msg.en_US 5.1.7.1 COMMITTED GPFS Server Messages - U.S. gpfs.base 5.1.7.1 COMMITTED GPFS File Manager gpfs.docs.data 5.1.7.1 COMMITTED GPFS Server Manpages
Now I document the licenses:
# mmlslicense -L
Node name Required license Designated license
---------------------------------------------------------------------
gpfslpar1.local server server
gpfslparp.local server server
gpfslpars.local server server
gpfslpars1.local client client
Summary information
---------------------
Number of nodes defined in the cluster: 4
Number of nodes with server license designation: 3
Number of nodes with FPO license designation: 0
Number of nodes with client license designation: 1
Number of nodes still requiring server license designation: 0
Number of nodes still requiring client license designation: 0
This node runs IBM Spectrum Scale Advanced Edition.
And then the actual cluster and configuration (mmlscluster and mmlsconfig):
# mmlscluster
GPFS cluster information
========================
GPFS cluster name: GPFSCL1.local
GPFS cluster id: 87671296340124043
GPFS UID domain: GPFSCL1.local
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Repository type: CCR
Node node name IP address Admin node name Designation
-----------------------------------------------------------------
1 gpfslpar1.local 192.168.2.13 gpfslpar1.local quorum-manager
2 gpfslparp.local 192.168.2.21 gpfslparp.local quorum-manager
3 gpfslpars.local 192.168.2.22 gpfslpars.local quorum-manager
13 gpfslpars1.local 192.168.2.23 gpfslpars1.local
#mmlsconfig
Configuration data for cluster GPFSCL1.local:
-----------------------------------------------------
clusterName GPFSCL1.local
clusterId 87671296340124043
autoload yes
dmapiFileHandleSize 32
ccrEnabled yes
cipherList AUTHONLY
seqDiscardThreshhold 4G
prefetchPct 40
workerThreads 1536
maxFilesToCache 50000
maxStatCache 20000
maxblocksize 2048k
pagepool 96G
maxMBpS 12800
minReleaseLevel 5.1.7.0
adminMode central
File systems in cluster GPFSCL1.local:
--------------------------------------
/dev/gpfsdata
/dev/gpfsf1
/dev/gpfsf2
/dev/gpfsf3
/dev/gpfsf4
/dev/gpfsf5
/dev/gpfsf6
/dev/gpfsf7
Cluster manager node: 192.168.2.21 (gpfslparp)
I always record the number of disks and NSDs for the cluster on each node and then run errpt to make sure there are no current errors.
#gpfslpar1: lspv | wc -l 736 #gpfslpar1: lspv | grep nsd | wc -l 405
Lastly, I check for path problems and run “df -g” to record all the mounted filesystems.
#lspath | grep -i miss #lspath | grep -i ail #lspath | grep -i efin #df -g
Now for the Upgrade Itself
On the two application nodes I shutdown the applications and samba. Then on the primary node I check the cluster status, shutdown the cluster and start the upgrade.
# mmgetstate -av
gpfslpar1: mmgetstate -av
Node number Node name GPFS state
-----------------------------------------
1 gpfslpar1 active
2 gpfslparp active
3 gpfslpars active
13 gpfslpars1 active
#mmshutdown -a
#mmgetstate -av
gpfslpar1: mmgetstate -av
Node number Node name GPFS state
-----------------------------------------
1 gpfslpar1 down
2 gpfslparp down
3 gpfslpars down
13 gpfslpars1 down
I double check that /usr/lpp/mmfs/bin is in my path in /etc/environment. This cluster has been around for a while, so it is there. Then I mount the filesystem (it is NFS exported from my NIM server) that contains the code I will be installing. The AIX install consists of two filesets—one is to update to 5.1.2.0 and the second set is to go to 5.2.1.1. For Linux there is only one that goes straight to 5.2.1.1.
#mount /usr/local/software
#cd /usr/local/software/spectrumscale/gpfsv521/aix-install-5210
#smitty update_all
A total of seven filesets were installed and they brought the node to 5.2.1.0
#lslpp -l | grep gpfs gpfs.adv 5.2.1.0 COMMITTED GPFS Advanced Features gpfs.base 5.2.1.0 COMMITTED GPFS File Manager gpfs.compression 5.2.1.0 COMMITTED GPFS Compression Libraries gpfs.crypto 5.2.1.0 COMMITTED GPFS Cryptographic Subsystem gpfs.gskit 8.0.55.19 COMMITTED GPFS GSKit Cryptography gpfs.license.adv 5.2.1.0 COMMITTED IBM Spectrum Scale Advanced gpfs.msg.en_US 5.2.1.0 COMMITTED GPFS Server Messages - U.S. gpfs.base 5.2.1.0 COMMITTED GPFS File Manager gpfs.docs.data 5.2.1.0 COMMITTED GPFS Server Manpages
Then I change into the 5.2.1 1 directory to apply those updates.
#cd /usr/local/software/spectrumscale/gpfsv521/aix-update-5211
#smitty update_all
This installed four filesets.
lslpp -l | grep gpfs
gpfs.adv 5.2.1.1 COMMITTED GPFS Advanced Features
gpfs.base 5.2.1.1 COMMITTED GPFS File Manager
gpfs.compression 5.2.1.0 COMMITTED GPFS Compression Libraries
gpfs.crypto 5.2.1.1 COMMITTED GPFS Cryptographic Subsystem
gpfs.gskit 8.0.55.19 COMMITTED GPFS GSKit Cryptography
gpfs.license.adv 5.2.1.0 COMMITTED IBM Spectrum Scale Advanced
gpfs.msg.en_US 5.2.1.0 COMMITTED GPFS Server Messages - U.S.
gpfs.base 5.2.1.1 COMMITTED GPFS File Manager
gpfs.docs.data S5.2.1.1 COMMITTED GPFS Server Manpages
#updtvpkg
I always run updtvpkg when I update filesets, especially rpm.
The above process should be done on all the nodes. Once they are all at 5.2.1.1 I rewrite the boot image, rewrite the bootlist and reboot the nodes.
#bootlist -m normal -o #bosboot -a -d hdisk0 #bootlist -m normal hdisk0 #bootlist -m normal -o #shutdown -r now
After the reboot, the nodes should all come up with the cluster active unless you have it set to be brought up manually. My cluster comes up automatically. I wait about five minutes and then check the status:
#mgetstate
Node number Node name GPFS state
--------------------------------------
1 gpfslpar1 active
#mmgetstate -av
gpfslpar1: mmgetstate -av
Node number Node name GPFS state
-----------------------------------------
1 gpfslpar1 active
2 gpfslparp active
3 gpfslpars active
13 gpfslpars1 active
Post-Reboot Checks
At this point it is time to check that the cluster is functioning correctly. I go through the commands I entered earlier such as “df -g”, “lspv | wc -l” and “lspv | grep nsd | wc -l” and compare the results to those from before the changes. If the filesystems did not all mount, then you can issue the following command on the primary node:
#mmmount all
Then check the mounts with:
#gpfslpar1: mmlsmount all
File system gpfsdata is mounted on 3 nodes.
File system gpfsf1 is mounted on 3 nodes.
File system gpfsf2 is mounted on 4 nodes.
File system gpfsf3 is mounted on 4 nodes.
File system gpfsf4 is mounted on 3 nodes.
File system gpfsf5 is mounted on 3 nodes.
File system gpfsf6 is mounted on 3 nodes.
File system gpfsf7 is mounted on 3 nodes.
Other checks include:
gpfslpar1: mmgetstate -aLs
Node number Node name Quorum Nodes Total GPFS state
UP Nodes
---------------------------------------------------------------
1 gpfslpar1 2 3 4 active quorum node
2 gpfslparp 2 3 4 active quorum node
3 gpfslpars 2 3 4 active quorum node
13 gpfslpars1 2 3 4 active
Summary information
---------------------
Number of nodes defined in the cluster: 4
Number of local nodes active in the cluster: 4
Number of remote nodes joined in this cluster: 0
Number of quorum nodes defined in the cluster: 3
Number of quorum nodes active in the cluster: 3
Quorum = 2, Quorum achieved
gpfslpar1: mmlsmgr
file system manager node
---------------- ------------------
gpfsdata 192.168.2.13 (gpfslpar1)
gpfsf1 192.168.2.21 (gpfslparp)
gpfsf3 192.168.2.21 (gpfslparp)
gpfsf4 192.168.2.21 (gpfslparp)
gpfsf6 192.168.2.21 (gpfslparp)
gpfsf2 192.168.2.22 (gpfslpars)
gpfsf5 192.168.2.22 (gpfslpars)
gpfsf7 192.168.2.22 (gpfslpars)
Cluster manager node: 192.168.2.21 (gpfslparp)
At this point you are ready to bring up and test your applications and protocol nodes to determine if the cluster is ready for use.
After the Upgrade
I normally run another mksysb backup and a new gpfs.snap and AIX snap after the upgrade in case I need it for IBM. Additionally, there are a couple of other steps that need to be taken to finalize the upgrade, but these are not easily reversed so I usually wait a couple of weeks before doing this. There are two parts to this: 1.) Finalize the config and 2.) Finalize the filesystems.
To check the config level, run:
#mmlsconfig | grep Release minReleaseLevel 5.1.7.0
The above means that only nodes installed at 5.1.7.0 and above can join the cluster and only commands and features that were available at 5.1.7.0 can be used. In order to activate the config at 5.2.1, you need to run mmchconfig. Prior to that you can revert by rebooting from the clone taken at the beginning or by shutting down the cluster, uninstalling gpfs and reinstalling the old version, then bringing the nodes back up. There are ways to revert after you run mmchconfig but it is much more challenging.
#mmchconfig release=LATEST #mmlsconfig | grep Release minReleaseLevel 5.2.1.0
At this point only nodes running 5.2.1.* can join the cluster and the new features will be available.
The final step is to upgrade the filesystems to the latest metadata format changes. Once this is done the disk images can no longer be read by prior versions of Storage Scale. To revert you would have to recreate the filesystem from backup media. To perform the filesystem upgrade:
Check the current level.
#gpfslpar1: mmlsfs gpfsf4 | grep -i "ile system version"
-V 31.00 (5.1.7.0) File system version
#gpfslpar1: mmlsfs gpfsf5 | grep -i "ile system version"
-V 31.00 (5.1.7.0) Current file system version
17.00 (4.2.3.0) Original file system version
Unmount the filesystems and then change them.
#mmumount all
#mmchfs gpfsf4 -V full
#mmchfs gpfsf5 -V full
Mount the filesystems again.
#mmmount all #gpfslpar1: mmlsfs gpfsf4 | grep -i "ile system version" -V 35.00 (5.2.1.0) Current file system version 31.00 (5.1.7.0) Original File system version #gpfslpar1: mmlsfs gpfsf5 | grep -i "ile system version" -V 35.00 (5.2.1.0) Current file system version 17.00 (4.2.3.0) Original file system version
The filesystems now show that they are at 5.2.1. I have not had this issue yet, but some new file system features might require more processing that cannot be handled by the mmchfs -V command alone. To fully activate such features, in addition to mmchfs -V, you must also run the mmmigratefs command.
Cleanup
If you don’t cleanup afterwards, the old versions of Storage Scale can fill up /var. I normally keep the last version and delete the older ones on each node.
#cd /usr/lpp/mmfs
Remove old versions—in my case:
#rm -f -R 5.0.3.2 #rm -f -R 5.0.4.4
I kept 5.1.7.1 as it was the previous one and I kept 5.2.1.1 as it is current.
Summary
In this article I stepped through the process I went through to upgrade my AIX Spectrum Scale cluster. Depending on the type and number of nodes (protocol, non-protocol, AIX, Linux, Windows, etc.), your process may be different, but this should give you an idea of some of the steps that you need to take for upgrading a simple AIX cluster.
References
Storage Scale FAQ
https://www.ibm.com/docs/en/STXKQY/pdf/gpfsclustersfaq.pdf?cp
Storage Scale Snap
https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=details-using-gpfssnap-command
https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=reference-gpfssnap-command
AIX Snap
https://www.ibm.com/support/pages/working-ibm-aix-support-collecting-snap-data
Storage Scale 5.2.1.1 Readme
https://www.ibm.com/support/pages/node/7170420
Storage Scale Upgrading
https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=upgrading
Supported Upgrade Paths
https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=upgrading-storage-scale-supported-upgrade-paths
mmigratefs command
https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=reference-mmmigratefs-command
Reverting to previous Storage Scale Levels
https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=rplss-reverting-previous-level-gpfs-when-you-have-not-issued-mmchconfig-releaselatesthttps://www.ibm.com/docs/en/storage-scale/5.2.1?topic=rplss-reverting-previous-level-gpfs-when-you-have-issued-mmchconfig-releaselatest
Upgrading non-protocol Nodes
https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=upgrading-storage-scale-non-protocol-linux-nodes
What’s new in Storage Scale 5.2.0?
https://www.spectrumscaleug.org/wp-content/uploads/2024/07/SSUG24ISC-Whats-new-in-Storage-Scale-5.2.0.pdf