| SRDB ID | Synopsis | Date | ||
| 48128 | Sun Fire[TM] 15K: Recovering from a System Controller disk failure | 4 Nov 2002 |
| Status | Issued |
| Description |
How to restore full Sun Fire 15K System Controller functionality following failure of one or both SC disks.
SOLUTION SUMMARY:
Assumptions:
OS version: Solaris[TM] 8 10/01
SC software version: SMS 1.1
SDS version: SDS 4.2.1
Prerequisites:
Output of 'metadb -i' command (included in explorer).
Output of 'metastat -p' command (included in explorer)
Output for the 'prtvtoc' command for all disks (included in explorer)
Recent backup tape for filesystems (if both disks lost)
Scenario #1: Loss of 1 of 2 disks on SC
1. As we have to shutdown the SC to replace the defective disk, we need to
ensure that this SC is the SPARE before shutting it down. As user sms-svc,
use the 'showfailover' command to determine status of the SCs:-
% showfailover -v
SC Failover Status: ACTIVE
Clock Phase Locked: .....................................Yes
HASRAM Status (by location):
HASRAM (CSB at CS1): ....................................Good
HASRAM (CSB at CS0): ....................................Good
Status of sf15k-sc1:
Role: ....................................MAIN
...
Status of sf15k-sc0:
Role: ...................................SPARE
...
2. Use the 'metastat' command to determine the failed submirrors:-
d10: Mirror
Submirror 0: d11
State: Okay
Submirror 1: d12
State: Needs maintenance
...
d20: Mirror
Submirror 0: d21
State: Okay
Submirror 1: d22
State: Needs maintenance
...
d30: Mirror
Submirror 0: d31
State: Okay
Submirror 1: d32
State: Needs maintenance
...
3. Use the 'metadb' command to determine unavailable/unreadable state databases
replicas:-
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t2d0s4
a p luo 1050 1034 /dev/dsk/c0t2d0s4
a p luo 2084 1034 /dev/dsk/c0t2d0s4
a p luo 16 1034 /dev/dsk/c0t2d0s5
a p luo 1050 1034 /dev/dsk/c0t2d0s5
a p luo 2084 1034 /dev/dsk/c0t2d0s5
M p Unknown Unknown /dev/dsk/c0t3d0s4
M p Unknown Unknown /dev/dsk/c0t3d0s4
M p Unknown Unknown /dev/dsk/c0t3d0s4
M p Unknown Unknown /dev/dsk/c0t3d0s5
M p Unknown Unknown /dev/dsk/c0t3d0s5
M p Unknown Unknown /dev/dsk/c0t3d0s5
4. Use the 'metadb' command to delete the state databases replicas on the bad disk:-
# metadb -d -f c0t3d0s4
# metadb -d -f c0t3d0s5
Depending on how the disk has failed, this step may not succeed. If this is the case,
we will delete the state database replicas during reboot (step 9).
5. As we have to shutdown the SC to replace the defective disk, we need to ensure
that the SC will boot using the correct OBP alias. Prevent the SC rebooting after
shutdown to the ok prompt by setting "auto-boot?" to false using the 'eeprom' command
as superuser on the SC:-
# eeprom 'auto-boot?=false'
6. If a disk failure occurs on the MAIN SC, loss of a disk is not a failover condition
so a failover will need to be forced to the other SC by using the 'setfailover' command
as user sms-svc:-
% setfailover force
This action will force the former MAIN SC to reset and reboot as SPARE and transfer the
role of MAIN SC to the opposite SC.
If a disk failure occurs on the SPARE SC, disable failover on the MAIN SC using the
'setfailover' command as user sms-svc. The SPARE SC can then be shut down:-
On the MAIN SC
% setfailover off
On the SPARE SC
# init 0
7. Replace the defective disk in the SCPER board (see the Sun Fire 15K System Service
Manual 806-3512-xx).
8. Boot using the correct OBP alias:-
ok devalias
disk2 /pci@1f,0/pci@1,1/scsi@2/sd@2,0:a
disk3 /pci@1f,0/pci@1,1/scsi@2/sd@3,0:a
If faulty disk was c0t2d0 (disk2), boot from disk3
If faulty disk was c0t3d0 (disk3), boot from disk2
ok boot disk2
or
ok boot disk3
9. If step 4 above failed (the metadb -d -f command was unsuccessful due to the nature
of the disk failure)
OR
If a reboot occurs before replacing the disk, the current boot will fail and stop in
single-user mode as 51% readable state database replicas are needed.
If this is the case, in single-user mode, use the 'metadb' command to delete the state
databases replicas (ignore any "Read-only file system" error messages),
# metadb -d -f c0t3d0s4
# metadb -d -f c0t3d0s5
then proceed with normal startup.
10. Partition the new disk in the same manner as it was before using the 'format' command.
11. Re-create state databases with the 'metadb' command using previous configuration
# metadb -a -c3 -f c0t3d0s4
# metadb -a -c3 -f c0t3d0s5
This configuration can be checked using 'metadb -i'.
12. Use the 'metareplace' command to re-enable the sub-mirrors:-
# metareplace -e d10 c0t3d0s0
# metareplace -e d20 c0t3d0s1
# metareplace -e d30 c0t3d0s7
This operation will take about 20 minutes per every gigabyte of filesystem.
This configuration can be checked using 'metastat'.
13. Set auto-boot? to true using the 'eeprom' command as superuser:-
# eeprom 'auto-boot?=true'
14. Failover must be enabled using the 'setfailover' command as user sms-svc user on
the MAIN SC:-
% setfailover on
15. Synchronize data from MAIN SC to SPARE SC using the 'setdatasync' command as
user sms-svc on the MAIN SC:-
% setdatasync backup
Scenario #2: Loss of both disks on SC
If the SDS-mirrored root disk for SC has been completely destroyed, here are steps
to resolve issues with the mirrored boot configuration. If this problem occurs on
the MAIN SC, failover to the opposite SC using the 'setfailover' command as
user sms-svc:-
% setfailover force
We are now working on a SPARE SC which has the defective disks.
1. If disk failures have occurred, replace the defective disks in SCPER board
(see Sun Fire 15K System Service Manual 806-3512-xx). If disks have not failed but
have been corrupted, continue with step 2 below.
2. Boot cdrom using Solaris 8 10/01 CD, partition BOTH disks in the same manner
as they were before (use format, prtvtoc or other SC as example), newfs each of
the partitions and mount the root filesystem to /a:
ok boot cdrom -s
# format
# newfs /dev/rdsk/c0t2d0s0
# newfs /dev/rdsk/c0t2d0s7
# mount /dev/dsk/c0t2d0s0 /a
3. Restore the root filesystem from backup tape into /a and initialize the root
block using:-
# installboot /usr/platform/sun4u/lib/fs/ufs/bootblk /dev/rdsk/c0t2d0s0
4. Restore the /export/install filesystem.
5. Modify /etc/system file: remove all lines between the "MDD root info" lines
and between the "MDD database info" lines:-
* Begin MDD root info (do not edit)
forceload: misc/md_trans
forceload: misc/md_raid
forceload: misc/md_hotspares
forceload: misc/md_sp
forceload: misc/md_stripe
forceload: misc/md_mirror
forceload: drv/pcipsy
forceload: drv/simba
forceload: drv/glm
forceload: drv/sd
rootdev:/pseudo/md@0:0,10,blk
* End MDD root info (do not edit)
* Begin MDD database info (do not edit)
set md:mddb_bootlist1="sd:20:16 sd:20:1050 sd:20:2084 sd:21:16 sd:21:1050"
set md:mddb_bootlist2="sd:21:2084 sd:28:16 sd:28:1050 sd:28:2084 sd:29:16"
set md:mddb_bootlist3="sd:29:1050 sd:29:2084"
* End MDD database info (do not edit)
6. Modify /etc/vfstab file by changing all metadevices for the root filesystem
back to regular slices. Comment out all other metadevices:-
Before
#device device mount FS fsck mount mount
#to mount to fsck point type pass at boot options
fd - /dev/fd fd - no -
/proc - /proc proc - no -
/dev/md/dsk/d20 - - swap - no -
/dev/md/dsk/d10 /dev/md/rdsk/d10 / ufs 1 no logging
/dev/md/dsk/d30 /dev/md/rdsk/d30 /export/install ufs 2 yes logging
swap - /tmp tmpfs - yes -
After
#device device mount FS fsck mount mount
#to mount to fsck point type pass at boot options
fd - /dev/fd fd - no -
/proc - /proc proc - no -
/dev/dsk/c0t2d0s1 - - swap - no -
/dev/dsk/c0t2d0s0 /dev/rdsk/c0t2d0s0 / ufs 1 no logging
#/dev/md/dsk/d30 /dev/md/rdsk/d30 /export/install ufs 2 yes logging
swap - /tmp tmpfs - yes -
7. Remove all lines (except comment lines) from /etc/lvm/mddb.cf file.
8. Boot the system from the freshly restored boot disk:-
ok boot disk2
At this time, this SC is defined as the SPARE SC.
For reference, disk aliases are:-
ok devalias
disk2 /pci@1f,0/pci@1,1/scsi@2/sd@2,0:a
disk3 /pci@1f,0/pci@1,1/scsi@2/sd@3,0:a
9. Re-create state databases with the 'metadb' command using previous configuration:-
# metadb -a -c3 -f c0t2d0s4
# metadb -a -c3 -f c0t2d0s5
# metadb -a -c3 -f c0t3d0s4
# metadb -a -c3 -f c0t3d0s5
10. Modify the /etc/lvm/md.tab, make sure that all mirrors are one-way mirrors, make
sure that the one-way mirrors refer to the restored side:-
Before
d10 -m d11 d12
d11 1 1 /dev/dsk/c0t2d0s0
d12 1 1 /dev/dsk/c0t3d0s0
d20 -m d21 d22
d21 1 1 /dev/dsk/c0t2d0s1
d22 1 1 /dev/dsk/c0t3d0s1
d30 -m d31 d32
d31 1 1 /dev/dsk/c0t2d0s7
d32 1 1 /dev/dsk/c0t3d0s7
After
d10 -m d11
d11 1 1 /dev/dsk/c0t2d0s0
d12 1 1 /dev/dsk/c0t3d0s0
d20 -m d21
d21 1 1 /dev/dsk/c0t2d0s1
d22 1 1 /dev/dsk/c0t3d0s1
d30 -m d31
d31 1 1 /dev/dsk/c0t2d0s7
d32 1 1 /dev/dsk/c0t3d0s7
11. Create the metadevices:-
# metainit -f -a
12. Set the metadevice as a root device:-
# metaroot d10
13. Restore metadevice entries in /etc/vfstab file:-
#device device mount FS fsck mount mount
#to mount to fsck point type pass at boot options
fd - /dev/fd fd - no -
/proc - /proc proc - no -
/dev/md/dsk/d20 - - swap - no -
/dev/md/dsk/d10 /dev/md/rdsk/d10 / ufs 1 no logging
/dev/md/dsk/d30 /dev/md/rdsk/d30 /export/install ufs 2 yes logging
swap - /tmp tmpfs - yes -
14. Reboot
15. Second way mirrors can now be attached to mirrored metadevices using the
'metattach' command:-
# metattach d10 d12
# metattach d20 d22
# metattach d30 d32
16. Failover must be enabled using the 'setfailover' command as user sms-svc on
the MAIN SC:-
% setfailover on
17. Synchronize data from MAIN SC to SPARE SC using the 'setdatasync' command as
user sms-svc on the MAIN SC:-
% setdatasync backup
Note: Scripts are available on the EIS-CD to set up the SC disks:-
/sun/tools/SF15K/SF15k-sc-bootdisks-start.sh
/sun/tools/SF15K/SF15k-sc-bootdisks-finish.sh
After running the scripts:-
# df -k
Filesystem kbytes used avail capacity Mounted on
/dev/md/dsk/d10 8261393 1948634 6230146 24% /
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mnttab
swap 2185528 8 2185520 1% /var/run
swap 2187656 2136 2185520 1% /tmp
/dev/md/dsk/d30 7061557 1370547 5620395 20% /export/install
# metastat -p
d10 -m d11 d12 1
d11 1 1 c0t2d0s0
d12 1 1 c0t3d0s0
d20 -m d21 d22 1
d21 1 1 c0t2d0s1
d22 1 1 c0t3d0s1
d30 -m d31 d32 1
d31 1 1 c0t2d0s7
d32 1 1 c0t3d0s7
# metadb -i
flags first blk block count
a m p luo 16 1034 /dev/dsk/c0t2d0s4
a p luo 1050 1034 /dev/dsk/c0t2d0s4
a p luo 2084 1034 /dev/dsk/c0t2d0s4
a p luo 16 1034 /dev/dsk/c0t2d0s5
a p luo 1050 1034 /dev/dsk/c0t2d0s5
a p luo 2084 1034 /dev/dsk/c0t2d0s5
a p luo 16 1034 /dev/dsk/c0t3d0s4
a p luo 1050 1034 /dev/dsk/c0t3d0s4
a p luo 2084 1034 /dev/dsk/c0t3d0s4
a p luo 16 1034 /dev/dsk/c0t3d0s5
a p luo 1050 1034 /dev/dsk/c0t3d0s5
a p luo 2084 1034 /dev/dsk/c0t3d0s5
# format / partition / print
c0t2d0
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 3560 8.00GB (3561/0/0) 16779432
1 swap wu 3561 - 4451 2.00GB (891/0/0) 4198392
2 backup wm 0 - 7505 16.86GB (7506/0/0) 35368272
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 4452 - 4456 11.50MB (5/0/0) 23560
5 unassigned wm 4457 - 4461 11.50MB (5/0/0) 23560
6 unassigned wm 0 0 (0/0/0) 0
7 unassigned wm 4462 - 7505 6.84GB (3044/0/0) 14343328
c0t3d0
Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 3560 8.00GB (3561/0/0) 16779432
1 swap wu 3561 - 4451 2.00GB (891/0/0) 4198392
2 backup wm 0 - 7505 16.86GB (7506/0/0) 35368272
3 unassigned wu 0 0 (0/0/0) 0
4 unassigned wm 4452 - 4456 11.50MB (5/0/0) 23560
5 unassigned wm 4457 - 4461 11.50MB (5/0/0) 23560
6 unassigned wu 0 0 (0/0/0) 0
7 unassigned wm 4462 - 7505 6.84GB (3044/0/0) 14343328
ok printenv boot-device
boot-device=disk2 disk3
ok devalias
disk2 /pci@1f,0/pci@1,1/scsi@2/sd@2,0:a
disk3 /pci@1f,0/pci@1,1/scsi@2/sd@3,0:a INTERNAL SUMMARY:
SUBMITTER: Stephane Dutilleul APPLIES TO: Hardware/Sun Fire /15000 ATTACHMENTS: