| SRDB ID | Synopsis | Date | ||
| 49205 | Sun Fire[TM] 12K/15K: Dstop: Slot1 IO Data Valid phase error | 3 Dec 2002 |
| Status | Issued |
| Description |
- Problem Statement/Title: SF15K Troubleshooting Article:
Dstop: Slot1 IO Data Valid phase error (SDI)
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load dsmd.dstop.021125.1132.46
02 Created Mon Nov 25 11:32:47 2002
03 By hpost v. 1.3 sms1.3_14 Nov 14 2002 15:19:45 executing as pid=5395
04 On ssc name = xc15p13-sc1.SD_Lab.West.Sun.COM
05 Domain = 2=C = b1 Platform = sun15
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 000A0
08 Slot0[17:0]: 000A0
09 Slot1[17:0]: 000A0
10 -D option, -d
11 "DSMD DomainStop Dump"
12 Created in a Sun Microsystems Inc. internal environment.
13 7 errors occurred while creating this dump.
14 redxl> wfail
15 SDI EX05/S0 Master_Stop_Status0[31:0] = E004000F
16 MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
17 SDI EX05/S0 Dstop0[31:0] = 01018100
18 Dstop0[16]: D DARB texp requests all Dstop (M)
19 Dstop0[24]: D 1E SDI internal Slot1 port requested Dstop
20 SDI EX05/S0 Slot1_Error0[31:0] = 4000C000 Mask = 3000FFFF
21 S1Err0[30]: D 1E Slot1 IO Data Valid phase error (M)
22 FAIL Slot IO5: Dstop/Rstop detected by SDI
23 Primary service FRU is Slot IO5.
24 Secondary service FRU is EXB EX5.
25 SDI EX07/S0: All SDI is DStopped and RStopped, requested by DARB.
26 DARB C0: enabled ports (expanders) [17:0]: 3FDF1
27 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00020
28 DARB C1: enabled ports (expanders) [17:0]: 3FDF1
29 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00020
SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this Dstop was generated by dsmd (lines 10,11)
while a domain was active. This is also evident by the dumpf file name -
dsmd.dstop files are created by dsmd as part of an ASR. Also note that errors
occurred while creating the dump (line 13). This is typically an indicator that
a component wasn't available while register information was being collected.
Walking the error chain:
- Master SDI on EX5 is directed to Dstop by itself (lines 18,19)
- Master SDI on EX5 reports a Data Valid phase error (line 21)
- IO5 is FAILed from the configuration (line 22)
- IO5 and EX5 are named primary and secondary FRUs (lines 23,24)
As part of its normal operation, the SDI performs a clock comparison
of its input clock with that of its Slot 1 board. Specifically, the SDC
on the Slot 1 board produces an inverse signal (dataid_vld_l) to the
input clock signal. If the SDI detects that the inverse signal is no
longer in phase with the input signal, an error occurs.
IO5 and EX5 are named as suspect FRUs because the dataid_vld_l signal
crosses an interconnect. IO5 is FAILed as overall its removal has less
impact to the domain/system.
- Resolution:
In general, a lone "Data Valid phase error" is an indicator that the Slot 1
board is at fault. However, if multiple components report clocking errors,
further analysis is required (see SRDB 48293).
In addition to the dump file, investigate the following to gather additional
evidence which may indicate a specific FRU.
- Were any clock input failures reported? (Assumes SMS 1.2 with
patch 112481-06, or higher)
- Did a failover occur prior to the Dstop? When an SC becomes
MAIN, it will attempt to migrate clocks to the MAIN.
- Was the IO board powered off?
- Was an SC powered off?
Any of the above can account for an interruption in clock. If running
an older/unpatched version of SMS, executing 'showclocksrc' (downloadable
from CPRE) may also reveal bad inputs.
- Summary of part number and patch ID's
http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html
Patch ID 112481-06
- References and bug IDs
SRDB 48122
SRDB 48293
http://pts-americas.west.sun.com/esg/hsg/starcat/tools/showclocksrc.html
- Additional background information:
In this example, the dump header shows errors during dump creating (line 13).
By trying various "show" commands in 'redx', you may determine which
component(s) resulted in errors. A good starting point is the suspect
parts called out by 'wfail'. For this example:
30 redxl> shar 5 1
31 Note: Data is displayed from the currently loaded dump file.
32 AR IO5 (5.1) Component ID = DEAD2BBC
33 NOTE: 3 errors in the process of creating this structure
34 redxl> shsdc 5 1
35 Note: Data is displayed from the currently loaded dump file.
36 SDC IO05 Component ID = DEAD2BBC
37 NOTE: 3 errors in the process of creating this structure
38 redxl> shdx 5 1
39 Note: Data is displayed from the currently loaded dump file.
40 DXs IO5 Component ID = DEADBAD0
41 .0 Dev_Temp[8:0] = 000: Valid 0.51 DegC
42 .1 Dev_Temp[8:0] = 000: Valid 0.51 DegC
43 DX asics on board with non-0 error status [1:0] = 0
Nothing on IO5 is readable. Those ASICs directly accessable via console
bus (AR, SDC, DX) report either DEAD2BBC or DEADBAD0. This is an indicator
of a power issue. Also, from the platform message log:
Nov 25 11:21:27 2002 xc15p13-sc1 esmd[27845]: [2000 343712253731524 ERR
SysControl.cc 2635] A power failure has been detected on a redundant power
supply at +1.5_vdc1_ok; located on HPCI+ at IO15. SCHEDULE REPLACEMENT of
HPCI+ at IO15 as soon as possible. If an additional failure occurs on this
supply it may crash any dependent domain(s).
For this example, the HPCI+ board was experiencing power issues. IO5 is the
target FRU.
Keywords Section
-------------------
15K, 12K, SF15K, SF12K, starcat, dstop, Slot1 IO Data Valid phase error
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport PATCH ID: 112481-06 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: