| SRDB ID | Synopsis | Date | ||
| 48125 | Sun Fire[TM] 12K/15K: Rstop: Slot0 asserted EccErr, enabled to cause Rstop | 29 Oct 2002 |
| Status | Issued |
| Description |
- Problem Statement:
Rstop: Slot0 asserted EccErr, enabled to cause Rstop
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load xcstate.020213.2331.22
02 Created Wed Feb 13 23:31:22 2002
03 By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09 executing as pid=892
04 On ssc name = n017.new-sc1.
05 Domain = 0=A Platform = n017.new
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 0007F
08 Slot0[17:0]: 0007F
09 Slot1[17:0]: 0003F
10 Stop on EXB EX0 during stage cpu_lpost_II
11 0 errors occurred while creating this dump.
12 redxl> wfail
13 SDI EX00/S0: SDI is RStopped, requested by DARB.
14 SDI EX01/S0: SDI is RStopped, requested by DARB.
15 SDI EX02/S0: SDI is RStopped, requested by DARB.
16 SDI EX03/S0: SDI is RStopped, requested by DARB.
17 SDI EX04/S0: SDI is RStopped, requested by DARB.
18 SDI EX05/S0 Master_Stop_Status0[31:0] = 80040308
19 MStop0[3]: SDI is Recordstopped
20 SDI EX05/S0 Recordstop0[31:0] = 04018400
21 Rstop0[16]: R DARB texp request Recordstop (M)
22 Rstop0[26]: R 1E Slot0 asserted EccErr, enabled to cause Rstop (M)
23 EPLD SB05 Ecc_Err: Mask= F7 Err= 08 SDC reports EccErr
24 SDC SB05 EccStatus[31:0] = 0000E073
25 EccSt[15]: Safari port 0/1 Ecc error logged.
26 Received by DXs from local Safari port 1, read operation.
27 DX SB05/DX2 Ecc_Syndrome[31:0] = 00000121
28 Syndr[ 8: 0]: P01 Data: 121: CE bit 97
29 Syndr[ 15]: P01 Direction: 0: Safari port to DX (Incoming)
30 ECC correctable errors detected from Processor Port SB5/P1, no corresponding
31 parity error in DXs or DCDSs.
32 Assuming the error originated in memory on this port.
33 Data syndrome 121 is CE bit 97.
34 This bit is in one of Dimm SB5/P1/B0/D3 or Dimm SB5/P1/B1/D3.
35 Bank/Dimm fault attribution for data CEs is the responsibility of
36 lpost or domain software which has address information that
37 allows error attribution to a bank. No action taken here.
38 SDI EX06/S0: SDI is RStopped, requested by DARB.
39 DARB C0: enabled ports (expanders) [17:0]: 0007F
40 DARB C0: exps request Rstop [17:0]: 00020
41 DARB C0: other darb req Rstop for exps [17:0]: 00020
42 DARB C1: enabled ports (expanders) [17:0]: 0007F
43 DARB C1: exps request Rstop [17:0]: 00020
44 DARB C1: other darb req Rstop for exps [17:0]: 00020
SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this error was encountered during the cpu_lpost_II
stage of POST (line 10). This is also evident by the dump file name - xcstate
files are created when POST detects a stop condition. In fact, the output in
the POST log will look like the above, save the dump header. Walking the error
chain:
- SDI5 reports a first error of its Slot 0 board asserting an ECC error (line 22).
This equates to SB5.
- Next, the EPLD on SB5 is examined, and we see it's reporting an error from the
SDC (line 23).
- Continuing, the SDC's error registers reveal that it had an ECC error on a
DX (lines 24-26). Note also that the SDC can distinguish the erring operation
is a read operation involving Safari Port 1 (line 26). For the SDC, Safari
Port 1 is SB5/P1.
- Finally, the DX shows the syndrome and errored bit (lines 27,28). The direction
is also trapped by the DX (line 29).
wfail then examines the DXs and DCDSs for any corresponding parity errors that
are pertinent to the failure and reports its findings (lines 30-32). In this case,
there are no pertinent parity errors found, so the error is assumed to be sourced
in memory. Finally, wfail is able to narrow down to one of two DIMMs, but states
that either Solaris[TM] or LPOST must identify the exact DIMM (lines 34-37).
- Resolution:
If the stop was encountered during POST (as above), examine the POST log for
an error that identifies the exact DIMM. Example message:
Primary service FRU is Dimm SB5/P1/B0/D3
If the stop was encountered while Solaris was running, consult the /var/adm/messages
logs on the domain for the faulty DIMM. Example message:
SUNW,UltraSPARC-III+: NOTICE: [AFT0] Corrected system bus (CE) Event on CPU96 at TL=0, errID 0x000122cf.377e4dcb
AFSR 0x00000002<CE>.00000052 AFAR 0x00000060.64a9b870
Fault_PC 0x10024d20 Esynd 0x0052 SB3/P1/B0/D1 J14400
SUNW,UltraSPARC-III+: [AFT0] errID 0x000122cf.377e4dcb Corrected Memory Error on SB3/P1/B0/D1 J14400 is Persistent
SUNW,UltraSPARC-III+: [AFT0] errID 0x000122cf.377e4dcb Data Bit 28 was in error and corrected
Note that wfail assumes that the error originated in memory. But, it is possible
that the memory was written into memory with bad ECC. The failure history of the
system can be taken into account to troubleshoot further. A brief discussion of
a "bad writer" is in the Sun Fire 15K Architecture document.
- Summary of part number and patch ID's
- References and bug IDs
SunSolve Article 48122
http://webhome.eng.sun.com/alanc/sunfire/arch/sf15Karch.pdf
- Additional background information:
On occasion, there have been Rstops encountered during LPOST where a DIMM is not
identified by LPOST. A theory is that a memory access beyond the visibility of
LPOST encountered the ECC error; for example, when the AXQ accesses memory to
fetch MTags
- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:
- Keywords
15K, 12K, SF15K, SF12K, starcat, rstop, Slot0 asserted EccErr, enabled to cause Rstop
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: