| SRDB ID | Synopsis | Date | ||
| 48124 | Sun Fire[TM] 12K/15K: Rstop: Slot1 data parity bit 1 error | 30 Oct 2002 |
| Status | Issued |
| Description |
- Problem Statement:
Rstop: Slot1 data parity bit 1 error
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load dsmd.rstop.020915.1751.52
02 Created Sun Sep 15 17:51:53 2002
03 By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=13269
04 On ssc name = starcat1-sc0.bestbuy.com
05 Domain = 5=F = ds01ux Platform = starcat1
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 0E000
08 Slot0[17:0]: 0E000
09 Slot1[17:0]: 0E000
10 -D option, -d
11 "DSMD RecordStop Dump"
12 0 errors occurred while creating this dump.
13 redxl> wfail
14 SDI EX13/S0: SDI is RStopped, requested by DARB.
15 SDI EX14/S0: SDI is RStopped, requested by DARB.
16 SDI EX14/S0 Recordstop0[31:0] = 04018001
17 Rstop0[16]: R 1E DARB texp request Recordstop (M)
18 Rstop0[26]: R Slot0 asserted EccErr, enabled to cause Rstop (M)
19 SDI EX15/S0 Master_Stop_Status0[31:0] = 80040008
20 MStop0[3]: SDI is Recordstopped
21 SDI EX15/S0 Recordstop0[31:0] = 00010001
22 Rstop0[16]: R DARB texp request Recordstop (M)
23 SDI EX15/S0 Recordstop1[31:0] = 00408040
24 Rstop1[22]: R 1E SDI Slave 1 requested all Recordstop
25 SDI EX15/S1 Master_Stop_Status0[31:0] = 00000008
26 MStop0[3]: SDI is Recordstopped
27 SDI EX15/S1 Recordstop0[31:0] = 00408040
28 Rstop0[22]: R 1E SDI internal Slot1 port request Recordstop
29 SDI EX15/S1 Slot1_Error1[31:0] = 2000A000 Mask = FFFF4FFF
30 S1Err1[29]: R 1E Slot1 data parity bit 1 error
31 slt1_datap[1:0], slt1_data[23:0] = 3 000020
32 FAIL Slot IO15: Dstop/Rstop detected by SDI EX15/S1
33 Primary service FRU is Slot IO15.
34 Secondary service FRU is EXB EX15.
35 DARB C0: enabled ports (expanders) [17:0]: 0EDFF
36 DARB C0: exps request Rstop [17:0]: 08000
37 DARB C0: other darb req Rstop for exps [17:0]: 08000
38 DARB C1: enabled ports (expanders) [17:0]: 0EDFF
39 DARB C1: exps request Rstop [17:0]: 08000
40 DARB C1: other darb req Rstop for exps [17:0]: 08000
SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this Rstop was generated by dsmd (lines 10,11) while
a domain was active. This is also evident by the dump file name. dsmd.rstop files are
created by dsmd as part of error capturing. Walking the error chain:
- EX14 shows errors, but the first error is the DARB requesting the stop (lines 16-18).
- EX15/S0 (SDI0) reports a first error from slave SDI1 (line 24).
- EX15/S1 (SDI1) reports a first error of a parity bit 1 error from slot 1 (line 30).
- 'wfail' then informs us to fail IO15 (line 32) to avoid the error.
- IO15 and EX15 are identified as primary/secondary FRUs (lines 33,34)
When EX15/SDI1 detects the parity error, it implies that a bit error occurred to
data in transit between the DXs on IO15 and SDI1 on EX15. Therefore, the error happened
across an interconnect, and both IO15 and EX15 are suspect.
Data path parity for the SDIs is only utilized as an interconnect diagnostic tool.
Because the underlying data is protected by ECC, and the individual SDIs only have
information on their specific data slice (i.e., multi-bit errors are unknown to a
single SDI), data is allowed to pass despite parity errors. However, the SDI records
the parity error if further diagnosis is needed.
Since the data is allowed to pass, Solaris[TM] will also record an ECC event. For example:
41 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 729643 kern.notice] NOTICE:
42 [AFT0] Corrected system bus (CE) Event on CPU448 at TL=0, errID 0x0000593b.a4ebc76b
43 Sep 15 17:51:52 ds01ux AFSR 0x00000002<CE>.000001b8 AFAR 0x00000400.fef171d0
44 Sep 15 17:51:52 ds01ux Fault_PC 0x103484dc Esynd 0x01b8 Not memory
45 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 317326 kern.notice] [AFT0] errID 0x0000593b.a4ebc76b
46 Data Bit 127 was in error and corrected
47 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 370641 kern.info] [AFT2] errID 0x0000593b.a4ebc76b
48 E$tag PA=0x000001e1.9af171c0 does not match AFAR=0x00000400.fef171c0
49 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 684557 kern.info] [AFT2] errID 0x0000593b.a4ebc76b
50 PA=0x000001e1.9af171c0
51 Sep 15 17:51:52 ds01ux E$tag 0x00000786.6b000001 E$state_7 Invalid
In this example, Solaris corrects the single bit error. The Rstop file can pinpoint where in
the interconnect the bit error occurred.
- Resolution:
First the severity of the error must be judged. If the error was uncorrectable and
resulted in a domain interruption, a component replacement is in order. Or, if the
error is correctable, but repeating relatively often, a replacement may be best to
avoid a future interruption.
If a replacement is suitable, follow the suggestion of 'wfail'. Start with the IO
board. During replacement, examine the interconnect for pin damage. If the IO board
exchange does not correct the problem, the Expander is the secondary FRU.
- Summary of part number and patch ID's
501-5179 Expander
http://infoserver.central.sun.com/data/sshandbook/Devices/I_O/IO_SunFire_15K_hsPCI_IO_Board.html
- References and bug IDs
SunSolve Article 48122
- Additional background information:
- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:
- Keywords
15K, 12K, SF15K, SF12K, starcat, rstop, Slot1 data parity bit 1 error INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: