Patching Exalogic part 4a
Before we dive further into the brave new world of virtualization on the Exalogic, I thought I’d finish up my series of posts on patching. My previous posts on this subject detailed upgrading firstly the network compontents (Infiniband Gateway Switches, indicated in green) and secondly the storage component (ZFS 7320 appliance, in blue).
This post will deal with patching the operating system on the modified Sun Fire X4170 M2 servers (in red), dubbed compute nodes in Exalogic terminology. In our case the OS is Oracle Linux.
As before with the storage and network patching I will demonstrate that patching of the compute nodes can be done in a rolling fashion, maintaining application availability during the upgrade, provided that your application is deployed in redundant (HA) fashion, for example in a Weblogic cluster spread over more that one physical node.
As an example, I will take the installation of the Exalogic 184.108.40.206.1 patchset (patch 13569004). This is the quarterly upgrade from april 2012. After unpacking the patch and thoroughly examining the README and the corresponding upgrade advisor document on MOS we can get to work. First we have to apply the network- and storage patches for the infrastructure as I have described before, but that is not the subject of our current post. After having done so, we can start on the compute nodes, which is our current focus.
You can patch one compute node at a time, or multiple nodes in parallel. The patch script facilitates the latter as well, through the use of the Exalogic’s Distributed Command Line interface (DCLI). It’s always prudent to patch one compute node (with some less critical deployments if possible) first as a test case and carefully evaluate the results before moving on. If you are patching in a rolling fashion you would probably not patch more than a few nodes at a time anyway.
So, basically the rolling patch procedure should look something like this :
1. Do all preparatory steps that can be done without impacting running application services.
2. Stop applications services in one half of your application cluster, make sure users are redirected/failed over to the other half of the cluster.
3. Upgrade the node(s) underlying the first half of your application cluster (that you just took out of service). Check if all went well.
4. Restart the applications services on the freshly patched node(s) and check they have rejoined the cluster. Now take the other half of your cluster nodes out of service.
5. Upgrade the other half of your cluster in the same fashion and restart the application services on the second half, etc.
So, let’s try this out for ourselves now….
First we check if we are presently on the correct version of the Exalogic Base Image, the minimal version for a 220.127.116.11.x patchset is of course the 18.104.22.168.0 release. We can check this by running the “imageinfo” command across all the eight nodes in our quarter rack configuration. This is most easily done via a simple DCLI script :
[root@xxxxexa01 scripts]# cat check_imageinfo.scl imageinfo | head -1
[root@xxxxexa01 scripts]# dcli -t -g allnodes.lst -x check_imageinfo.scl Target nodes: ['xxxxexacn01', 'xxxxexacn02', 'xxxxexacn03', 'xxxxexacn04', 'xxxxexacn05', 'xxxxexacn06', 'xxxxexacn07', 'xxxxexacn08'] xxxxexacn01: Exalogic 22.214.171.124.0 (build:r213841) xxxxexacn02: Exalogic 126.96.36.199.0 (build:r213841) xxxxexacn03: Exalogic 188.8.131.52.0 (build:r213841) xxxxexacn04: Exalogic 184.108.40.206.0 (build:r213841) xxxxexacn05: Exalogic 220.127.116.11.0 (build:r213841) xxxxexacn06: Exalogic 18.104.22.168.0 (build:r213841) xxxxexacn07: Exalogic 22.214.171.124.0 (build:r213841) xxxxexacn08: Exalogic 126.96.36.199.0 (build:r213841)
The allnodes.lst file contains a list with the nodes we want to check, which is all of them in this case. Now that we have verified that we are OK on current versions, we can proceed with the patching process.
The README tells us how to put things in place before starting the actual patching proces and shutting down the application services on the nodes, thus minimizing downtime for each (set of) node(s):
.“Copy Base Image patch content from the patches share location to local disk on the compute node; create local node location if required. For example:
Step 1. Ok let’s set this up for compute node 8, which we will patch first in our example. We clean up any precious patch files first, just to be sure.
[root@xxxxexa08 ~]# rm -rf /opt/baseimage_patch/*
[root@xxxxexa08 ~]# cp -R /u01/common/patches/todo/13569004/Infrastructure/188.8.131.52.1/BaseImage/184.108.40.206.1/* /opt/baseimage_patch/.
[root@xxxxexa08 ~]# cd /opt/baseimage_patch
Step 2. Now we stop all application services on node 8 and verify that users and processes have failed over to node 7, where the other half of our cluster resides.
Step 3. As a precaution and to save on downtime, we should unmount any filesystems mounted over NFS. We don’t want any stray user processes barring unmount commands during reboot later and significantly slow down or even frustrate our patch job in the next step.
[root@xxxxexa08 baseimage_patch]# umount -avt nfs mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 xxxxexasn-priv:/export/ExalogicDemo1/otd umounted mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 xxxxexasn-priv:/export/ExalogicDemo1/oradata umounted mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 ... ... mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 umount: /u01/products/Middleware11gPS3: device is busy mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 xxxxexasn-priv:/export/common/patches umounted mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 xxxxexasn-priv:/export/common/general umounted
Now check if there’s no NFS filesystems left mounted… looks like we might have an issue!
[root@xxxxexa08 baseimage_patch]# mount -lt nfs xxxxexasn-priv:/export/products/Middleware11gPS3 on /u01/products/Middleware11gPS3 type nfs (rw,bg,hard,nointr,rsize=131072,wsize=131072,tcp,nfsvers=3, addr=192.168.10.30) xxxxexasn-priv:/export/ACSExalogicSystem/nodemgrs on /u01/ACSExalogicSystem/nodemgrs type nfs (rw,bg,hard,nointr,rsize=131072,wsize=131072,tcp,nfsvers=3, addr=192.168.10.30)
Ooops, got it, forgot to shutdown the Weblogic nodemanager on this node! That can be fixed quickly enough. Shutdown the nodemanager and retry :
[root@xxxxexa08 baseimage_patch]# umount -avt nfs mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 xxxxexasn-priv:/export/ACSExalogicSystem/nodemgrs umounted mount: trying 192.168.10.30 prog 100005 vers 3 prot tcp port 51606 xxxxexasn-priv:/export/products/Middleware11gPS3 umounted
[root@xxxxexa08 baseimage_patch]# mount -lt nfs
OK, no NFS filesystems left mounted. We could starting patching now, but to speed things up a bit more it’s (my personal) good practice to minimize mount/unmount times during the patch process by temporarely stripping out unneeded NFS entries in the /etc/fstab. Make sure you have make a good backup of the original /etc/fstab file as you need to restore it after the patch has completed. Also, as a precaution don’t take out the /u01/common/general entry as the patch files reside here (eventhough we made a local copy). I’ve had some problems when I did this before, when doing multiple nodes in parallel, so leave it in.
4. Patch execution
Now that we have a minimal set of entries in our /etc/fstab file, we should have a pretty speedy patch procedure. Since the patch installation involves a least two reboots, it’s handy to follow the proceedings across reboots by logging onto the console using the Integrated Lights Out Management interface in another session (as is mentioned later on in the README).
.JNs-MBP3-QA-2:~ jnwerk$ ssh email@example.com Password: Oracle(R) Integrated Lights Out Manager Version 220.127.116.11.a r68533 Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved. -> start /SP/console Are you sure you want to start /SP/console (y/n)? y Serial console started. To stop, type ESC (
From this ILOM console session you can follow what goes on through reboots as well. We can now start the patch script.
Step 4. Start the patch script ebi_patch.sh from /opt/baseimage_patch/scripts[root@xxxxexa08 baseimage_patch]# cd scripts ; ./ebi_patch.sh INFO: Wed Jul 4 13:45:27 CEST 2012: Compute Node Image Version Found: 18.104.22.168.0 INFO: Wed Jul 4 13:45:27 CEST 2012: Patch state file not found; creating file INFO: Wed Jul 4 13:45:27 CEST 2012: Preparing to update kernel... INFO: Wed Jul 4 13:45:27 CEST 2012: Backing up configuration files INFO: Wed Jul 4 13:45:27 CEST 2012: Done backing up configuration files INFO: Wed Jul 4 13:45:27 CEST 2012: Uninstalling infinibus INFO: Wed Jul 4 13:45:27 CEST 2012: Done Uninstalling infinibus INFO: Wed Jul 4 13:45:27 CEST 2012: Uninstalling OFED_IOV warning: /etc/libsdp.conf saved as /etc/libsdp.conf.rpmsave warning: /etc/infiniband/openib.conf saved as /etc/infiniband/openib.conf.rpmsave INFO: Wed Jul 4 13:45:41 CEST 2012: Done uninstalling OFED_IOV INFO: Wed Jul 4 13:45:41 CEST 2012: Uninstalling OFA INFO: Wed Jul 4 13:45:41 CEST 2012: Done uninstalling OFA INFO: Wed Jul 4 13:45:41 CEST 2012: Updating kernel warning: ../OS/OracleLinux_5.6/Kernel/2.6.32-200.21.2.el5uek /kernel-uek-2.6.32-200.21.2.el5uek.x86_64.rpm: Header V3 DSA signature: NOKEY, key ID 1e5e0159 WARNING: No module ehci-hcd found for kernel 2.6.32-200.21.2.el5uek, continuing anyway WARNING: No module ohci-hcd found for kernel 2.6.32-200.21.2.el5uek, continuing anyway WARNING: No module uhci-hcd found for kernel 2.6.32-200.21.2.el5uek, continuing anyway rmdir: /lib/modules/2.6.32-200.21.1.el5uek/updates/dkms: No such file or directory INFO: Wed Jul 4 13:45:57 CEST 2012: Done updating kernel INFO: Wed Jul 4 13:45:57 CEST 2012: Updating grub.conf INFO: Wed Jul 4 13:45:57 CEST 2012: Done updating grub.conf INFO: Wed Jul 4 13:45:57 CEST 2012: Kernel update done on compute node. INFO: Wed Jul 4 13:45:57 CEST 2012: IMPORTANT: REBOOTING NOW. This script will AUTO-RUN ONCE after reboot. Broadcast message from root (pts/0) (Wed Jul 4 13:45:57 2012): The system is going down for reboot NOW! [root@xxxxexa08 baseimage_patch]# Connection to xxxxexa08 closed by remote host. Connection to xxxxexa08 closed.
The README says the following about this step :“Once the script completes execution, the node will reboot with the updated kernel, and will auto-reboot again to apply patches/upgrades to other components on the compute node. Logs will be available in ebi_20001.log and ebi_dcli.log files in the scripts directory.” .
Note that in between reboots, there usually is no Infiniband connectivity if the OFED drivers are upgraded.
In part two of this post we will check if all went OK, finish the patching procedure for this node and complete the rolling upgrade procedure for our Exalogic Compute nodes.
Hebt u vragen of suggesties?
De Bruyn Kopsstraat 9
2288EC Rijswijk (ZH)
+31.(0)70 319 5000