| SRDB ID | Synopsis | Date | ||
| 48493 | Sun Fire[TM] 12K/15K: Dstop: CDC indicates an owner outside the domain | 1 Nov 2002 |
| Status | Issued |
| Description |
- Problem Statement:
Dstop: CDC indicates an owner outside the domain
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load dsmd.dstop.020510.0947.08
02 Created Fri May 10 09:47:10 2002
03 By hpost v. 1.2 Generic 112488-04 Mar 18 2002 14:43:00 executing as pid=6825
04 On ssc name = rasputin-sc0.SD_RASCAL.West.Sun.COM
05 Domain = 0=A Platform = rasputin
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 12100
08 Slot0[17:0]: 12100
09 Slot1[17:0]: 12100
10 -D option, -d
11 "DSMD DomainStop Dump"
12 0 errors occurred while creating this dump.
13 redxl> wfail
14 SDI EX08/S0 Master_Stop_Status0[31:0] = E004000F
15 MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
16 SDI EX08/S0 Dstop0[31:0] = 04218400
17 Dstop0[16]: D DARB texp requests all Dstop (M)
18 Dstop0[21]: D SDI internal STB port requested Dstop
19 Dstop0[26]: D 1E AXQ requests Slot0 Dstop (M)
20 SDI EX08/S0 Recordstop0[31:0] = 00818080
21 Rstop0[16]: R DARB texp request Recordstop (M)
22 Rstop0[23]: R 1E AXQ requests all Recordstop (M)
23 AXQ EX08 ( 8) Error_Flag_07[31:0] = 020B8200 Mask = 63FF7D24
24 Err7[16]: R CDC0 correctable error
25 Err7[17]: R CDC0 address parity error
26 Err7[19]: R CDC1 correctable error
27 Err7[25]: R 1E CDC uncorrectable error
28 AXQ EX08 ( 8) Error_Flag_08[31:0] = 20002000 Mask = 0000FFFF
29 Err8[29]: D CDC indicates an owner outside the domain
30 FAIL CDC Dimm EX8: Dstop/Rstop detected by AXQ.
31 Primary service FRU is EXB EX8.
32 SDI EX13/S0: All SDI is DStopped and RStopped, requested by DARB.
33 SDI EX16/S0: All SDI is DStopped and RStopped, requested by DARB.
34 DARB C0: enabled ports (expanders) [17:0]: 16100
35 DARB C0: exps request Dstop+Rstop [17:0]: 00100
36 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00100
37 DARB C1: enabled ports (expanders) [17:0]: 16100
38 DARB C1: exps request Dstop+Rstop [17:0]: 00100
39 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00100 SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this Dstop was generated by dsmd (lines 10,11)
while a domain was active. This is also evident by the dumpf file name -
dsmd.dstop files are created by dsmd as part of an ASR. Walking the
error chain:
- Master SDI on EX8 is directed to Dstop by AXQ8 (line 19)
- Master SDI on EX8 is directed to Rstop by AXQ8 (line 22)
- AXQ8 reports several CDC related errors, all indicating Rstop (lines 24-27)
- AXQ8 reports a fatal error in the CDC (line 29)
- The CDC DIMM is FAILed from the configuration (line 30)
- EX8 is named as the FRU (line 31)
The CDC DIMM is divided into 3 SRAMs, read in parallel, forming a 3-way
set associative cache. CDC entries contain information about lines of
memory recently referenced by SSM logic.
Any error (correctable or uncorrectable) in the CDC is recorded and
logged, but never causes a Dstop. Entries with correctable errors are
written back with the corrected data. Uncorrectable errors are treated
as cache misses. Notice that all the errors recorded in AXQ8's Err7
register (lines 24-27) are all Recordstop events ('R' precedes the
error description).
However, from the name of the dump file (line 01) and the dsmd action
(line 11), we know this is a Dstop. The Dstop is triggered because the
data in the CDC indicates the owner of a cache line is a board that is
not in the resources comprising this domain. Either AXQ8 wrote the
offending error, or the CDC entry has been trashed. In either scenario,
this fault is deemed serious (coherency within the domain is
in question), thus the Dstop. So, although the first error is a
Recordstop (line 27), because another error requiring Dstop occurs,
the stop acted upon is a Dstop.
In this case, because of the sheer number of CDC-related errors, it is
clear that the CDC is in dire straits. That's why the CDC DIMM is FAILed
from the configuration (line 30). The CDC is not a FRU, so the expander
must be replaced (line 31).
Also note the blacklisting suggestion made by wfail:
40 redxl> wfail -B
41 membrd SB8 # redx wfail of dump 020510.0947.10
By not using memory on SB8, there is no home memory within EX8. Thus,
the CDC DIMM on EX8 is not used.
- Resolution:
Repair/replace EX8.
- Summary of part number and patch ID's
http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html
- References and bug IDs
SunSolve Article 48122
SDI ASIC Specification
Starcat Architecture, 11/07/2000
- Additional background information:
The details of the CDC DIMM entries in error is available in the AXQ data
capture. First, understand the format of a CDC entry:
SHARED ENTRY
============
|3| 2| 1| |
|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|
+-+-----------------------------------+-----------------------+
|1| Bitmask of sharers | Tag |
+-+-----------------------------------+-----------------------+
OWNED ENTRY
============
|3| 2| 1| |
|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|
+-+---------------------+-+-+---------+-----------------------+
|0| Unused |V|R| Owner | Tag |
+-+---------------------+-+-+---------+-----------------------+
V = Valid Entry (1 = valid, 0 = invalid)
R = Retention priority
Bit [30] indicates if the line is Owned or Shared. In Owned entries,
bits [16:12] indicate the boardset that owns the line. The owner field
is only valid if bit [18] is set. In Shared entries, bits [29:12]
indicate which boardsets contain a copy of the cache line. The bit
indicating a shared entry (bit [30]) implies a valid entry.
A 3-wide CDC entry spanning the 3 CDC SRAMs is further protected by
8 bit ECC. A 3-wide entry also uses 3 bits of LRM (Least Recently
Modified) to help in selection of an entry during victimization.
Examine this dump example:
42 redxl> shaxq -e 8
43 Note: Data is displayed from the currently loaded dump file.
44 AXQ EX8 (8) Component ID = C4312049 Rev 6.0
45 Error_Flag_00[31:0] = 00000000 Mask = 0000FFFF
46 Error_Flag_01[31:0] = 00000000 Mask = 4000FFFF
47 Error_Flag_02[31:0] = 00000000 Mask = 0000FFFF
48 Error_Flag_03[31:0] = 00000000 Mask = 21005EFF
49 Error_Flag_04[31:0] = 00000000 Mask = 01FEFFFF
50 Error_Flag_05[31:0] = 00000000 Mask = 1024FFFF
51 Error_Flag_06[31:0] = 00000000 Mask = 7E00FFFF
52 Error_Flag_07[31:0] = 020B8200 Mask = 63FF7D24
53 Err7[16]: R CDC0 correctable error
54 CDC error count[3:0] = A Read Addr[18:0] = 19172 (GoodApar= 0)
55 CDC 0 sram data[35:0] = E.D0000C9E
56 CDC0 entry: Shared, Mask = 10000, Tag = C9E
57 CDC 1 sram data[35:0] = F.50000E1E
58 CDC1 entry: Shared, Mask = 10000, Tag = E1E
59 CDC 2 sram data[35:0] = A.50000D9E
60 CDC2 entry: Shared, Mask = 10000, Tag = D9E
61 ECC Syndrome[7:0] = 88: Uncorrectable Error
62 cdc_errsave1[19]: Capture is for Outside Domain Error
63 LRU[3:0] = A
64 cdc_errsave0[3:0][31:0] = 6D0000 C9EF5000 0E1EA500 00D9E880
65 cdc_errsave1[31:0] = 0A099172
66 Err7[17]: R CDC0 address parity error
67 CDC error save data is displayed above.
68 Err7[19]: R CDC1 correctable error
69 CDC error save data is displayed above.
70 Err7[25]: R 1E CDC uncorrectable error
71 CDC error save data is displayed above.
72 Error_Flag_08[31:0] = 20002000 Mask = 0000FFFF
73 Err8[29]: D CDC indicates an owner outside the domain
74 CDC error save data is displayed above.
75 Error_Flag_09[31:0] = 00000000 Mask = 7E00FFFF
76 Error_Flag_10[31:0] = 00000000 Mask = 7C00FFFF
77 Error_Flag_11[31:0] = 00000000 Mask = 7FF0FFFF
Let's focus on the CDC0 entry (lines 55,56). The CDC0 entry is decoded by
redx. We see the line is shared and the Mask is 10000. Thus, SB16 is the
only sharer for this cache line. By our dump header, SB16 is in the domain
(line 08). So by the data capture, there is no indication of the owner
being outside the domain resources. Since the first error was an uncorrectable
ECC error, the data in the capture is likely from that event. Subsequent
CDC errors are not captured until after the dump is collected and the
ASICs are rearmed.
Note that this fault was injected by grounding part of the pathway between
the CDC SRAM and AXQ. Fault injection aside, if a "CDC indicates an owner
outside the domain" error, it implies one of two things:
o The AXQ is writing faulty ownership/shared to the CDC entries
o Multiple flips occurred in a CDC entry
In any case, the expander is the FRU.
- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:
- Keywords
15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
starcat, dstop, CDC indicates an owner outside the domain
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: