Upgrading Your Storage Scale Cluster, Part 2

In this two-part series, IBM Champion Jaqui Lynch shares the lessons she learned when upgrading from Spectrum Scale 5.1.7.1

Jaqui Lynch November 5, 2024

In part one of this two-part series, I discussed things you should be aware of prior to upgrading your Spectrum Scale cluster. In this article I will cover the actual upgrade process.

Preparing to Upgrade the Cluster From 5.1.7.1 to 5.2.1.1

The first step is to download the code and unpack it into a shared directory that is not in GPFS/Scale. I then take my backups and snaps. As I mentioned in part 1, I take an AIX snap and a gpfs.snap prior to starting the upgrade. I also take a mksysb backup to my backup server and I clone rootvg using alt_disk_copy before starting the upgrade—this is done on every node. The mksysb and copy give me two different recovery methods if there are issues with the upgrade.

Here is an example of the clone which is done on every node:

#lspv | grep root
hdisk0   00ce48c00c98baf1  rootvg    active
hdisk1   00ce48c00cbd3762  altinst_rootvg

#exportvg altinst_rootvg
#alt_disk_copy -V -B -d hdisk1

The above clones rootvg to hdisk1. If I need to revert later, I can set the bootlist to hdisk1 and reboot. Alternatively, I can use nimadm to apply the upgrade to the alternate disk (hdisk1) instead of updating the live disk.

Documenting the Cluster

The cluster I am upgrading has 4 AIX LPARs in it. All 4 LPARs have the same disks mapped to them using NPIV from the VIO servers. The primary LPAR and one filesystem manager are on one physical server and are quorum nodes, the second filesystem manager and application server are on a second server with the filesystem manager also being a quorum node. Although all the luns are mapped to each LPAR, the NSDs were changed from direct attach to network. This was done as there used to be a bunch of Linux nodes in the cluster as well.

The cluster looks like this:

Server 1

gpfslpar1     Quorum node, runs apps and samba, FS manager
gpfslparp     Quorum node, FS manager

Server 2

gpfslpars     Quorum node, FS manager
gpfslpars1    Runs apps and samba

Prior to making any changes, I check various levels and cluster settings. I check the level of rpm, perl, python, ssh, ssl and java on every node, and I check samba on the SMB application nodes.

#oslevel -s
7300-02-02-2420

#lslpp -l | grep rpm
            4.18.1.2003

#rpm -qa | grep rpm
rpm-python3.9-4.15.1-64_4.ppc
rpm-python3-4.15.1-64_4.ppc
AIX-rpm-7.3.2.3-73.ppc

#lslpp -l | grep perl
            5.34.1.6

#lslpp -l | grep -i python
            3.9.19.3

#lslpp -l | grep ssl
            3.0.13.1000

#lslpp -l | grep ssh
9.2.112.2400

#lslpp -l | grep ava
            8.0.0.826

#rpm -qa | grep samba
samba-client-4.18.9-1.ppc
samba-devel-4.18.9-1.ppc
samba-winbind-clients-4.18.9-1.ppc
samba-4.18.9-1.ppc
samba-common-4.18.9-1.ppc
samba-libs-4.18.9-1.ppc
samba-winbind-4.18.9-1.ppc

After checking that these are up to date on all four nodes, I then document the actual cluster, starting with the level installed.

#lslpp -l | grep gpfs
  gpfs.adv                   5.1.7.1  COMMITTED  GPFS Advanced Features
  gpfs.base                  5.1.7.1  COMMITTED  GPFS File Manager
  gpfs.compression           5.1.7.0  COMMITTED  GPFS Compression Libraries
  gpfs.crypto                5.1.7.1  COMMITTED  GPFS Cryptographic Subsystem
  gpfs.gskit               8.0.55.19  COMMITTED  GPFS GSKit Cryptography
  gpfs.license.adv           5.1.7.0  COMMITTED  IBM Spectrum Scale Advanced
  gpfs.msg.en_US             5.1.7.1  COMMITTED  GPFS Server Messages - U.S.
  gpfs.base                  5.1.7.1  COMMITTED  GPFS File Manager
  gpfs.docs.data             5.1.7.1  COMMITTED  GPFS Server Manpages

Now I document the licenses:

# mmlslicense -L
 Node name                  Required license   Designated license
---------------------------------------------------------------------
gpfslpar1.local                     server               server
gpfslparp.local                      server              server
gpfslpars.local                     server               server
gpfslpars1.local                   client                client                   
 Summary information
---------------------
Number of nodes defined in the cluster: 4
Number of nodes with server license designation: 3
Number of nodes with FPO license designation: 0
Number of nodes with client license designation: 1
Number of nodes still requiring server license designation: 0
Number of nodes still requiring client license designation: 0

This node runs IBM Spectrum Scale Advanced Edition.

And then the actual cluster and configuration (mmlscluster and mmlsconfig):

# mmlscluster
GPFS cluster information
========================
  GPFS cluster name:         GPFSCL1.local
  GPFS cluster id:           87671296340124043
  GPFS UID domain:           GPFSCL1.local
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp
  Repository type:           CCR
 Node  node name    IP address    Admin node name Designation
-----------------------------------------------------------------
 1  gpfslpar1.local  192.168.2.13  gpfslpar1.local quorum-manager
 2  gpfslparp.local  192.168.2.21  gpfslparp.local quorum-manager
 3  gpfslpars.local  192.168.2.22  gpfslpars.local quorum-manager
13  gpfslpars1.local 192.168.2.23  gpfslpars1.local
#mmlsconfig

Configuration data for cluster GPFSCL1.local:
-----------------------------------------------------
clusterName GPFSCL1.local
clusterId 87671296340124043
autoload yes
dmapiFileHandleSize 32
ccrEnabled yes
cipherList AUTHONLY
seqDiscardThreshhold 4G
prefetchPct 40
workerThreads 1536
maxFilesToCache 50000
maxStatCache 20000
maxblocksize 2048k
pagepool 96G
maxMBpS 12800
minReleaseLevel 5.1.7.0
adminMode central
File systems in cluster GPFSCL1.local:
--------------------------------------
/dev/gpfsdata
/dev/gpfsf1
/dev/gpfsf2
/dev/gpfsf3
/dev/gpfsf4
/dev/gpfsf5
/dev/gpfsf6
/dev/gpfsf7
Cluster manager node: 192.168.2.21 (gpfslparp)

I always record the number of disks and NSDs for the cluster on each node and then run errpt to make sure there are no current errors.

#gpfslpar1: lspv | wc -l
     736
#gpfslpar1: lspv | grep nsd | wc -l
     405

Lastly, I check for path problems and run “df -g” to record all the mounted filesystems.

#lspath | grep -i miss
#lspath | grep -i ail
#lspath | grep -i efin
#df -g

Now for the Upgrade Itself

On the two application nodes I shutdown the applications and samba. Then on the primary node I check the cluster status, shutdown the cluster and start the upgrade.

# mmgetstate -av
gpfslpar1: mmgetstate -av
 Node number  Node name      GPFS state
-----------------------------------------
      1       gpfslpar1        active
      2       gpfslparp        active
      3       gpfslpars        active
     13       gpfslpars1       active

#mmshutdown -a
#mmgetstate -av
gpfslpar1: mmgetstate -av
 Node number  Node name      GPFS state
-----------------------------------------
      1       gpfslpar1         down
      2       gpfslparp         down
      3       gpfslpars         down
     13       gpfslpars1        down

I double check that /usr/lpp/mmfs/bin is in my path in /etc/environment. This cluster has been around for a while, so it is there. Then I mount the filesystem (it is NFS exported from my NIM server) that contains the code I will be installing. The AIX install consists of two filesets—one is to update to 5.1.2.0 and the second set is to go to 5.2.1.1. For Linux there is only one that goes straight to 5.2.1.1.

#mount /usr/local/software
#cd /usr/local/software/spectrumscale/gpfsv521/aix-install-5210
#smitty update_all

A total of seven filesets were installed and they brought the node to 5.2.1.0

#lslpp -l | grep gpfs
  gpfs.adv          5.2.1.0   COMMITTED  GPFS Advanced Features
  gpfs.base         5.2.1.0   COMMITTED  GPFS File Manager
  gpfs.compression  5.2.1.0   COMMITTED  GPFS Compression Libraries
  gpfs.crypto       5.2.1.0   COMMITTED  GPFS Cryptographic Subsystem
  gpfs.gskit        8.0.55.19 COMMITTED  GPFS GSKit Cryptography
  gpfs.license.adv  5.2.1.0   COMMITTED  IBM Spectrum Scale Advanced
  gpfs.msg.en_US    5.2.1.0   COMMITTED  GPFS Server Messages - U.S.
  gpfs.base         5.2.1.0   COMMITTED  GPFS File Manager
  gpfs.docs.data    5.2.1.0   COMMITTED  GPFS Server Manpages

Then I change into the 5.2.1 1 directory to apply those updates.

#cd /usr/local/software/spectrumscale/gpfsv521/aix-update-5211
#smitty update_all

This installed four filesets.

lslpp -l | grep gpfs
  gpfs.adv                   5.2.1.1  COMMITTED  GPFS Advanced Features
  gpfs.base                 5.2.1.1  COMMITTED  GPFS File Manager
  gpfs.compression    5.2.1.0  COMMITTED  GPFS Compression Libraries
  gpfs.crypto               5.2.1.1  COMMITTED  GPFS Cryptographic Subsystem
  gpfs.gskit                 8.0.55.19  COMMITTED  GPFS GSKit Cryptography
  gpfs.license.adv      5.2.1.0  COMMITTED  IBM Spectrum Scale Advanced
  gpfs.msg.en_US     5.2.1.0  COMMITTED  GPFS Server Messages - U.S.
  gpfs.base                5.2.1.1  COMMITTED  GPFS File Manager
  gpfs.docs.data        S5.2.1.1  COMMITTED  GPFS Server Manpages
#updtvpkg

I always run updtvpkg when I update filesets, especially rpm.

The above process should be done on all the nodes. Once they are all at 5.2.1.1 I rewrite the boot image, rewrite the bootlist and reboot the nodes.

#bootlist -m normal -o
#bosboot -a -d hdisk0
#bootlist -m normal hdisk0
#bootlist -m normal -o
#shutdown -r now

After the reboot, the nodes should all come up with the cluster active unless you have it set to be brought up manually. My cluster comes up automatically. I wait about five minutes and then check the status:

#mgetstate
 Node number  Node name   GPFS state
--------------------------------------
           1         gpfslpar1                   active
#mmgetstate -av
gpfslpar1: mmgetstate -av
 Node number  Node name      GPFS state
-----------------------------------------
      1       gpfslpar1      active
      2       gpfslparp      active
      3       gpfslpars      active
     13       gpfslpars1     active

Post-Reboot Checks

At this point it is time to check that the cluster is functioning correctly. I go through the commands I entered earlier such as “df -g”, “lspv | wc -l” and “lspv | grep nsd | wc -l” and compare the results to those from before the changes. If the filesystems did not all mount, then you can issue the following command on the primary node:

#mmmount all

Then check the mounts with:


#gpfslpar1: mmlsmount all
File system gpfsdata is mounted on 3 nodes.
File system gpfsf1 is mounted on 3 nodes.
File system gpfsf2 is mounted on 4 nodes.
File system gpfsf3 is mounted on 4 nodes.
File system gpfsf4 is mounted on 3 nodes.
File system gpfsf5 is mounted on 3 nodes.
File system gpfsf6 is mounted on 3 nodes.
File system gpfsf7 is mounted on 3 nodes.

Other checks include:

gpfslpar1: mmgetstate -aLs
Node number Node name Quorum  Nodes Total  GPFS state
                              UP    Nodes     
---------------------------------------------------------------
    1       gpfslpar1    2    3     4      active   quorum node
    2       gpfslparp    2    3     4      active   quorum node
    3       gpfslpars    2    3     4      active   quorum node
   13       gpfslpars1   2    3     4      active
 Summary information
---------------------
Number of nodes defined in the cluster:            4
Number of local nodes active in the cluster:       4
Number of remote nodes joined in this cluster:     0
Number of quorum nodes defined in the cluster:     3
Number of quorum nodes active in the cluster:      3
Quorum = 2, Quorum achieved
gpfslpar1: mmlsmgr
file system      manager node
---------------- ------------------
gpfsdata         192.168.2.13 (gpfslpar1)
gpfsf1           192.168.2.21 (gpfslparp)
gpfsf3           192.168.2.21 (gpfslparp)
gpfsf4           192.168.2.21 (gpfslparp)
gpfsf6           192.168.2.21 (gpfslparp)
gpfsf2           192.168.2.22 (gpfslpars)
gpfsf5           192.168.2.22 (gpfslpars)
gpfsf7           192.168.2.22 (gpfslpars)
Cluster manager node: 192.168.2.21 (gpfslparp)

At this point you are ready to bring up and test your applications and protocol nodes to determine if the cluster is ready for use.

After the Upgrade

I normally run another mksysb backup and a new gpfs.snap and AIX snap after the upgrade in case I need it for IBM. Additionally, there are a couple of other steps that need to be taken to finalize the upgrade, but these are not easily reversed so I usually wait a couple of weeks before doing this. There are two parts to this: 1.) Finalize the config and 2.) Finalize the filesystems.

To check the config level, run:

#mmlsconfig | grep Release
minReleaseLevel 5.1.7.0

The above means that only nodes installed at 5.1.7.0 and above can join the cluster and only commands and features that were available at 5.1.7.0 can be used. In order to activate the config at 5.2.1, you need to run mmchconfig. Prior to that you can revert by rebooting from the clone taken at the beginning or by shutting down the cluster, uninstalling gpfs and reinstalling the old version, then bringing the nodes back up. There are ways to revert after you run mmchconfig but it is much more challenging.

#mmchconfig release=LATEST
#mmlsconfig | grep Release
minReleaseLevel 5.2.1.0

At this point only nodes running 5.2.1.* can join the cluster and the new features will be available.

The final step is to upgrade the filesystems to the latest metadata format changes. Once this is done the disk images can no longer be read by prior versions of Storage Scale. To revert you would have to recreate the filesystem from backup media. To perform the filesystem upgrade:

Check the current level.

#gpfslpar1: mmlsfs gpfsf4 | grep -i "ile system version"
 -V                 31.00 (5.1.7.0)          File system version
#gpfslpar1: mmlsfs gpfsf5 | grep -i "ile system version"
 -V                 31.00 (5.1.7.0)          Current file system version
                    17.00 (4.2.3.0)          Original file system version

Unmount the filesystems and then change them.

#mmumount all
#mmchfs gpfsf4 -V full
#mmchfs gpfsf5 -V full

Mount the filesystems again.

#mmmount all
#gpfslpar1: mmlsfs gpfsf4 | grep -i "ile system version"
-V                 35.00 (5.2.1.0)          Current file system version
                    31.00 (5.1.7.0)           Original File system version
#gpfslpar1: mmlsfs gpfsf5 | grep -i "ile system version"
-V                 35.00 (5.2.1.0)          Current file system version
                    17.00 (4.2.3.0)           Original file system version

The filesystems now show that they are at 5.2.1. I have not had this issue yet, but some new file system features might require more processing that cannot be handled by the mmchfs -V command alone. To fully activate such features, in addition to mmchfs -V, you must also run the mmmigratefs command.

Cleanup

If you don’t cleanup afterwards, the old versions of Storage Scale can fill up /var. I normally keep the last version and delete the older ones on each node.

#cd /usr/lpp/mmfs

Remove old versions—in my case:

#rm -f -R 5.0.3.2
#rm -f -R 5.0.4.4

I kept 5.1.7.1 as it was the previous one and I kept 5.2.1.1 as it is current.

Summary

In this article I stepped through the process I went through to upgrade my AIX Spectrum Scale cluster. Depending on the type and number of nodes (protocol, non-protocol, AIX, Linux, Windows, etc.), your process may be different, but this should give you an idea of some of the steps that you need to take for upgrading a simple AIX cluster.