| SRDB ID | Synopsis | Date | ||
| 48233 | Sun Fire[TM] 12K/15K: Rstop: Mtag CEs Corrected by SDI (M) | 31 Oct 2002 |
| Status | Issued |
| Description |
- Problem Statement:
Rstop: Mtag CEs Corrected by SDI (M).
- Symptoms:
redx wfail, shsdc and shdx command output reports the following
failure signature:
01 redxl> dumpf load dsmd.rstop.020927.0951.39
02 Created Fri Sep 27 09:51:39 2002
03 By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=18215
04 On ssc name = swmtftk01.
05 Domain = 0=A = MES Platform = f15k4
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 00007
08 Slot0[17:0]: 00007
09 Slot1[17:0]: 00007
10 -D option, -d
11 "DSMD RecordStop Dump"
12 0 errors occurred while creating this dump.
13 redxl> wfail
14 SDI EX00/S0: SDI is RStopped, requested by DARB.
15 SDI EX01/S0 Master_Stop_Status0[31:0] = 70000308
16 MStop0[3]: SDI is Recordstopped
17 SDI EX01/S0 Recordstop0[31:0] = 04218020
18 Rstop0[16]: R DARB texp request Recordstop (M)
19 Rstop0[21]: R 1E SDI internal Slot0 port request Recordstop
20 Rstop0[26]: R Slot0 asserted EccErr, enabled to cause Rstop (M)
21 SDI EX01/S0 Slot0_Error1[31:0] = 00408040 Mask = 31444EBF
22 S0Err1[22]: R 1E Slot0 quadword 0 Mtag correctable ECC error (M)
23 {q0_mtag[2:0],q0_mtag_ecc[3:0]} = 02. Calc'd Syndrome[3:0] = 2
24 CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
25 {q1_mtag[2:0],q1_mtag_ecc[3:0]} = 00. Calc'd Syndrome[3:0] = 0
26 DX SB1/DX2 reports from-port MTAG ECC syndrome 2 (port 0/1) matching
27 one of the MTAG syndromes recorded by the SDI.
28 SDI detected an MTAG ECC error from Slot SB1, and a non-1st EccErr asserted
29 from the same slot. There is also a from-port MTAG ecc error
30 detected by a DX on this slot, with the same syndrome.
31 Analysis will assume the actual first error is that recorded
32 by the DX, for better fault identification.
33 EPLD SB01 Ecc_Err: Mask= F7 Err= 08 SDC reports EccErr
34 SDC SB01 EccStatus[31:0] = 0000E041
35 EccSt[15]: Safari port 0/1 Ecc error logged.
36 Received by DXs from local Safari port 1, read operation.
37 DX SB01/DX2 Ecc_Syndrome[31:0] = 00000800
38 Syndr[13:10]: P01 Mtag: 2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
39 Syndr[ 15]: P01 Direction: 0: Safari port to DX (Incoming)
40 ECC correctable errors detected from Processor Port SB1/P1, no corresponding
41 parity error in DXs or DCDSs.
42 Assuming the error originated in memory on this port.
43 Mtag syndrome 2 is CE check bit 1 (bit 4 of [6:0], 141 of [143:0]).
44 This bit is in one of Dimm SB1/P1/B0/D2 or Dimm SB1/P1/B1/D2.
45 MTag CEs should be corrected by the SDIs, so there will be no resulting
46 processor trap with address information that would permit
47 bank attribution. Error isolation to the bank is not possible.
48 We must therefore FAIL all memory on this port.
49 FAIL All memory on Port SB1/P1: Rstop detected by DXs/SDC.
50 Primary service FRU is All memory on Port SB1/P1.
51 Secondary service FRU is Slot SB1.
52 SDI EX02/S0: SDI is RStopped, requested by DARB.
53 DARB C0: enabled ports (expanders) [17:0]: 03E3F
54 DARB C0: exps request Rstop [17:0]: 00002
55 DARB C0: other darb req Rstop for exps [17:0]: 00002
56 DARB C1: enabled ports (expanders) [17:0]: 03E3F
57 DARB C1: exps request Rstop [17:0]: 00002
58 DARB C1: other darb req Rstop for exps [17:0]: 00002
59 redxl> shsdc -e 1 0
60 Note: Data is displayed from the currently loaded dump file.
61 SDC SB01 Component ID = 416C107D
62 Lockstep_Err[19:0] = 00000
63 L2_Check__Err[23:0] = 000000
64 EccStatus[31:0] = 0000E041
65 EccSt[15]: Safari port 0/1 Ecc error logged.
66 Received by DXs from local Safari port 1, read operation.
67 No Safari port 2/3 Ecc error logged.
68 Enabled SDC ports [9:0] = 01F
69 Non-0 port errregs [9:0] = 000
70 redxl> shdx -e 1 0 2
71 Note: Data is displayed from the currently loaded dump file.
72 DX SB1/DX2 Component ID = 416C307D
73 Gen_Err[31:0] = 00000000
74 Ecc_Syndrome[31:0] = 00000800
75 Syndr[13:10]: P01 Mtag: 2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
76 Syndr[ 15]: P01 Direction: 0: Safari port to DX (Incoming)
77 Enabled DX ports[9:0] = 03F
78 PortErr is non-0[9:0] = 000
79 redxl> shsdi -e 1
80 Note: Data is displayed from the currently loaded dump file.
81 SDI EX01/S0 Component ID = 64317049
82 Master_Stop_Status0[31:0] = 70000308
83 MStop0[3]: SDI is Recordstopped
84 Master_Stop_Status1[31:0] = 8181000D
85 0x01 CP1StopExp[4:0] MSS1[20:16]
86 0 CP1StopSlot[0:1] MSS1[22:21] Rstop is 1st stop
87 1 CP1StopInfoValid MSS1[23]
88 0x01 CP0StopExp[4:0] MSS1[28:24]
89 0 CP0StopSlot[0:1] MSS1[30:29] Rstop is 1st stop
90 1 CP0StopInfoValid MSS1[31]
91 Dstop0[31:0] = 00000000
92 Dstop1[31:0] = 00000000
93 Recordstop0[31:0] = 04218020
94 Rstop0[16]: R DARB texp request Recordstop (M)
95 Rstop0[21]: R 1E SDI internal Slot0 port request Recordstop
96 Rstop0[26]: R Slot0 asserted EccErr, enabled to cause Rstop (M)
97 Recordstop1[31:0] = 00000000
98 Core_Error0[31:0] = 00000000 Mask = 0051FFFF
99 Core_Error1[31:0] = 00000000 Mask = FFFFFFFF
100 Sysreg_Error[31:0] = 00000000 Mask = 780377FF
101 STB_Error[31:0] = 00000000 Mask = 7F00FFFF
102 CP_Error0[31:0] = 00000000 Mask = 580067FF
103 CP_Error1[31:0] = 00000000 Mask = 7FFCFFFF
104 Slot0_Error0[31:0] = 00000000 Mask = 7000FFFF
105 Slot0_Error1[31:0] = 00408040 Mask = 31444EBF
106 S0Err1[22]: R 1E Slot0 quadword 0 Mtag correctable ECC error (M)
107 {q0_mtag[2:0],q0_mtag_ecc[3:0]} = 02. Calc'd Syndrome[3:0] = 2
108 CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
109 {q1_mtag[2:0],q1_mtag_ecc[3:0]} = 00. Calc'd Syndrome[3:0] = 0
110 Slot0_ErrData[4:2][31:0] = 001FBC00 000C0000 00000000
111 Slot0_ErrData[1:0][31:0] = 00080000 00000000
112 Slot0_Error2[31:0] = 00000000 Mask = 7FFCFFFF
113 Slot1_Error0[31:0] = 00000000 Mask = 3000FFFF
114 Slot1_Error1[31:0] = 00000000 Mask = 31404EBF
115 Slot1_Error2[31:0] = 00000000 Mask = 7FFCFFFF
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us this Rstop dumpfile was generated by dsmd while the domain was running (lines 10,11). Walking the error chain: - Master SDI on EX1 reports a Slot 0 MTag ECC (lines 22-25) - The SDC reports an error on Safari port 0/1 (lines 34-36) - The DX reports an ECC error from Safari port 1 (lines 37-39) - Suspect DIMMs are identified (lines 42-44) - Since an individual DIMM cannot be identified by 'wfail', the processor controlling that memory is FAILed (line 49) - FRUs are named as memory of SB1/P1 (primary) and the system board itself (secondary) (lines 50-51) The DXs works in pairs (0&2, 1&3) to compute the ECC of each of the two Quadwords. From DX2, data is incoming from Safari port and the computed Mtag ECC syndrome is 2 which is CE check bit 1 or 141 of the quadword. With no corresponding parity error in DXs or DCDSs, we can assume the bit error originates from the memory of that Processor Port (SB1/P1). This bit would be one of Dimm SB1/P1/B0/D2 or Dimm SB1/P1/B1/D2. The reason there is no bank attribution is that the Mtag CE has been corrected by SDI(M), and no processor would ever see the Mtag CE. As a result, there is no processor trap taken with address information that would permit bank attribution unless it is from local memory. In this case the data would not pass through the SDI(M). There would be no console or domain messages indicating that this Mtag CE event occurred other than this Rstop dump. - Resolution: Treat Mtag CE the same as memory CE best practices. Do not replace DIMM(s) on the first occurrence. - Summary of part number and patch ID's - References and bug IDs Safari Specification/Starcat Architecture. - Additional background information: Safari Data Structure -------------------- <127--Data--0><--8ECC--0><2--Mtag--0><3--MtagECC--0> <--------------------------------------------------> 144-bit Mtag Syndrome Table ------------------- Mtag ECC syndrome 7: CE bit 0 (bit 137 of [143:0]) Mtag ECC syndrome B: CE bit 1 (bit 138 of [143:0]) Mtag ECC syndrome D: CE bit 2 (bit 139 of [143:0]) Mtag ECC syndrome 1: CE check bit 0 (bit 3 of [6:0], 140 of [143:0]) Mtag ECC syndrome 2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0]) Mtag ECC syndrome 4: CE check bit 2 (bit 5 of [6:0], 142 of [143:0]) Mtag ECC syndrome 8: CE check bit 3 (bit 6 of [6:0], 143 of [143:0]) Multiple-bit syndromes are anything else except 0 (since 0 indicates on error) - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, rstop, SDI(M), Mtag correctable ECC error.
INTERNAL SUMMARY:
SUBMITTER: Tong-Pheng Koh APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: