| InfoDoc ID | Synopsis | Date | ||
| 45890 | Sun Fire[TM] 3800-6800: Firmware revision 5.13.x issues and workarounds | 19 Jul 2002 |
| Status | Issued |
| Description |
The new Sun Fire [TM] 3800 - 6800 firmware, revision 5.13.0, will give us full SC failover functionality as well as fixing various outstanding bugs. The firmware, as before, will be distributed in a patch, in this case patch id:
Refer to the README in the patch for a full list of bug fixes.
Read the install.info and release_notes files in patch
The following information will continue to be valid for all 5.13.x variants unless specifically noted.
Below are a few issues and "gotchas" that you may encounter.
Essential - Always upgrade SSC1 first!
Failure to follow this instruction will result in problems such as crashed domains, lost configuration information, and inaccessible domains.
The remainder of this document discusses the following issues:
Problems which do not fit any issues listed above should be directed to the GSCC (http://gscc) in your GEO for analysis and escalation to CPRE as appropriate.
1. What to do if SSC0 is upgraded first
If SSC0 is upgraded first, the result will be you will now have two spare system controllers. DO NOT try and recover by pressing reset buttons, or re-flashing. You will almost certainly crash any running domains on your platform.
If SSC0 is upgraded first, there is a recovery procedure. Engage your local GSCC (http://gscc) if you find yourself in this situation, and ask for the recovery procedure. This procedure is not published because it uses undocumented commands.
Here's an example of what happens if SSC0 is updated first.
# telnet 4800-sc0
System Controller '4800-sc0':
Type 0 for Platform Shell
Type 1 for domain A console
Type 2 for domain B console
Type 3 for domain C console
Type 4 for domain D console
Input: 0
Platform Shell
4800-sc0:SC> flashupdate -f ftp://172.29.3.44/pub/112494-01 all
As part of this update, the system controller will automatically reboot.
RTOS will be upgraded automatically during the next boot.
ScApp will be upgraded automatically during the next boot.
After this update you must reboot each active domain that was upgraded.
Do you want to continue? [no] yes
Retrieving: ftp://172.29.3.44/pub/112494-01/sgcpu.flash
Validating ............. Done
Current firmware version: 5.12.6
New firmware version: 5.13.0
Programming /N0/SB2 PROM 0
Erasing ............. Done
Programming ............. Done
Verifying ............. Done
.
.
.
Flashupdate
Connecting to 172.29.3.44...
Transferring sgrtos.flash via FTP : 679648
Comparing image and flash...
Image and flash are different. Proceeding with update.
Erasing flashprom sectors at address 0x20000000: 11/11 = 100%
Programming: 11/11 = 100%
Connecting to 172.29.3.44...
Transferring sgsc.flash via FTP : 5548663
Comparing image and flash...
Image and flash are different. Proceeding with update.
Erasing flashprom sectors at address 0x36000000: 85/85 = 100%
Programming: 85/85 = 100%
.
.
.
Copyright 2001-2002 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Sun Fire 3800-6800 System Firmware
RTOS version: 23
ScApp version: 5.13.0
SC POST diag level: off
The date is Thursday, May 23, 2002, 11:29:30 AM GMT+01:00.
May 23 11:29:31 4800-sc0 Platform.SC: Boot: ScApp 5.13.0, RTOS 23
May 23 11:29:36 4800-sc0 Platform.SC: Clock Source: 75MHz
May 23 11:29:38 4800-sc0 Platform.SC: SC Failover Monitor: enabled
May 23 11:30:08 4800-sc0 Platform.SC: Spare System Controller
May 23 11:30:08 4800-sc0 Platform.SC: SC Failover: enabled but not active.
System Controller '4800-sc0':
Type 0 for Platform Shell
Input: 0
Platform Shell - Spare System Controller
4800-sc0:sc>
--------------------------------------------------------------------------
# telnet 4800-sc1
System Controller '4800-sc1':
Type 0 for Platform Shell
Input: 0
Platform Shell - Slave System Controller
4800-sc1:SC> 2. Hot-Plugging SCs with old revs of firmware into a 5.13.0 platform
5.13 firmware does not mix with 5.11 & 5.12 firmware. If SC1 is to be replaced in a platform running 5.13.x, and replacement has 5.11 or 5.12 firmware loaded, recovery is simple and outlined below.
If SC0 is to be replaced in a platform running 5.13.x, and the replacement has 5.11 or 5.12 firmware loaded, the replacement will not boot, as outlined below. Recovery is to remove and put an SC in at 5.13.x.
Warning: Before removing SC0, be sure to issue the following command from SC1 or you may crash any running domains:
poweroff ssc0
If SC0 is to be replaced in a 5.13 platform, ensure the replacement has 5.13.0 firmware loaded on it. Double check with the control room that this is the case.
Example - Hot-Plugging SC with old rev of firmware in slot SSC1
Output from SSC0:
sc0-4800a:SC> poweroff ssc1
SSC1: powered off
sc0-4800a:SC>
May 31 10:34:45 sc0-4800a Platform.SC: Clock failover disabled.
May 31 10:37:07 sc0-4800a Platform.SC: SSC1 removed
May 31 10:37:37 sc0-4800a Platform.SC: SSC1 inserted
sc0-4800a:SC>
sc0-4800a:SC>
May 31 10:39:57 sc0-4800a Platform.SC: SC Failover: the other SC is
running an old version of firmware which is not compatible with failover.
You need to upgrade this firmware as soon as possible.
sc0-4800a:SC>
sc0-4800a:SC> Output from SSC1:
Hardware Reset...
@(#) SYSTEM CONTROLLER(SC) POST 18 2001/06/14 11:20
PSR = 0x044010e5
PCR = 0x04004000
SelfTest running at DiagLevel:0x20
SC Boot PROM Test
BootPROM CheckSum Test
.
.
.
Console Bus Hub Test
CBH Register Access Test
POST Complete.
ERI Device Present
Getting MAC address for SSC1
MAC address is 8:0:20:d8:ab:64
Using DHCP to configure network interface
Attached TCP/IP interface to eri unit 0
Attaching interface lo0...done
interrupt: 100 Mbps full duplex link up
Initiating DHCP negotiations for eri0
dhcpcBind() failed: errno = 0xd0003
Adding 2851 symbols for standalone.
Copyright 2001 Sun Microsystems, Inc. All rights reserved.
RTOS version: 18
ScApp version: 5.11.9
SC POST diag level: min
The date is Friday, May 31, 2002, 3:39:42 AM PDT.
SbbcAsic.showResetReason: SBBC reset status=0160 POR
PowerOn or Invalid magic: Initializing the SC SRAM
May 31 03:39:46 noname Chassis-Port.SC: Backing up Static ID Info to NVCI
May 31 03:39:46 noname Chassis-Port.SC: Clock source: 75MHz
May 31 03:39:48 noname Chassis-Port.SC: Starting Slave Thread
System Controller 'noname.example.com':
Type 0 for Platform Shell
Input: 0
Platform Shell
noname:SC> showsc
SC: SSC1
SC date: Fri May 31 03:39:56 PDT 2002
SC uptime: 25 seconds
ScApp version: 5.11.9
RTOS version: 18
noname:SC> Example - Hot-Plugging SC with old rev of firmware in slot SSC0
Output from SSC1:
sc1-4800a:SC> poweroff ssc0
SSC0: powered off
sc1-4800a:SC>
May 31 10:48:28 sc1-4800a Platform.SC: SSC0 removed
May 31 10:49:02 sc1-4800a Platform.SC: SSC0 inserted
sc1-4800a:SC>
sc1-4800a:SC>
May 31 10:50:25 sc1-4800a Platform.SC: SC Failover: the other SC is
running an old version of firmware. It cannot be booted on this platform.
Contact your support organization.
sc1-4800a:SC>
sc1-4800a:SC>
sc1-4800a:SC> Output from SSC0:
Hardware Reset...
@(#) SYSTEM CONTROLLER(SC) POST 18 2001/06/14 11:20
PSR = 0x044010e5
PCR = 0x04004000
SelfTest running at DiagLevel:0x20
SC Boot PROM Test
BootPROM CheckSum Test
.
.
.
Console Bus Hub Test
CBH Register Access Test
POST Complete.
ERI Device Present
Getting MAC address for SSC0
MAC address is 8:0:20:d8:ab:63
Using DHCP to configure network interface
Attached TCP/IP interface to eri unit 0
Attaching interface lo0...done
Timeout waiting for network driver (flags=0x8062)
Adding 2851 symbols for standalone. SSC0 is unusable at this point. Recovery is to remove and put an SC in at 5.13.0.
3. Hot-Plugging SCs with 5.13.0 into a platform with older revs of firmware
Plugging an SC with 5.13.0 firmware into a 5.12.6 platform, slot SSC0
Remember, the platform will have had to be powered off to affect this FRU replacement. The state the system controllers end up in depends on which one boots first, which is largely down to SCPOST levels and the SC network settings. For example, an SC from logistics should be at default settings, which means SCPOST level min and the network configured for DHCP.
If SSC1 boots first, it will put out a heartbeat (since it is at 5.12.6), and this will cause the SSC0 to assume the role of spare.
System Controller 'noname.example.com':
Type 0 for Platform Shell.
Input: 0
Platform Shell - Spare System Controller
noname:sc> This is not a problem.
If SSC0 boots first, the SC may become confused. Ignore this.
Flashupdate SSC0 with 5.12.6 firmware, and power-cycle the platform
Plugging an SC with 5.13.0 firmware into a 5.12.6 platform, slot SSC1
Again, the platform will have had to be powered off to affect this FRU replacement.
If SSC0 boots first, it will be the main and SSC1 the spare. Flashupdate SSC1 with 5.12.6 firmware, and power-cycle the platform.
If SSC1 boots first, you will get a message on SSC1:
Platform.SC: SC Failover: the other SC is running an old version of firmware. It cannot be booted on this platform. Contact your support
SSC0 will be hung, at the point the RTOS finishes loading. Ignore SSC0, flashupdate SSC1 with 5.12.6 firmware, and power-cycle the platform. You will now be back at SSC0 as main and SSC1 as spare.
4. Replacing System Boards (SBs) and I/O Boards (IBs) with different revs of firmware
If you are going to replace a system board or I/O assembly, be aware that the replacement board firmware must be compatible with the system controller firmware. To check the firmware compatibility for each board, use the showboards command with the "-p version" or "-v" option.
If the firmware of the replacement board is not compatible with the firmware for the system controller, you must upgrade or downgrade the firmware on the replacement board accordingly, using flashupdate -c. It is recommended that replacement boards run the same revision of firmware as the other boards in the system.
5. SC Clock Failover Issues
The SC clock failover mechanism is different than the SC failover mechanism. The SC clock failover function does not happen at the same time as the SC failover function. When the system is up and running with no problems, all the boards are using a clock signal from the main system controller. However, once SC failover occurs, the main SC and the spare SC swap their roles. Subsequently, the boards within the system continue to use the same clock they were using prior to the failover.
Workaround:
Power off the system controller. The "poweroff sscX" command will automatically attempt to switch all the boards over to the clock supplied by "this" SC (i.e. the SC that is not being powered off). The "poweroff sscX" powers off the "other" system controller, not the one where the command is being typed.
6. SC Communication Issues After SC Failover
When the system is running normally and failover is enabled, the spare SC and the main SC communicate status and configuration changes with each other. If a failover occurs and the main SC transfers its responsibilities to the spare SC, failover between the two SCs becomes disabled. With failover disabled, no data is shared between the two SCs, and the most up-to-date configuration and status information is not passed between the two SCs. Failover must be manually re-enabled.
If the chassis of the system is then power-cycled, the roles of the main SC and the spare SC may not necessarily be the same as they were prior to the power cycle. It is possible for the system to boot using the previously spare SC (with a possibly outdated state configuration) as the new main SC.
Workaround:
If failover becomes disabled, manually re-enable failover as soon as possible so the configurations can be re-synchronized.
If this is not possible, do a dumpconfig as outlined in the Sun Fire 3800 - 6800 Platform Administration Guide. Then if the power is cycled and SSC0 assumes the role of main, you can restore the setup ts SC0 using restoreconfig. Note that you will have to copy <sc1_hostname>.tod & <sc1_hostname>.nvci to <sc0_hostname>.tod & <sc0_hostname>.nvci for this workaround.
Keywords: firmware, 5.13.x, 3800, 6800, sunfire
INTERNAL SUMMARY: