| SRDB ID | Synopsis | Date | ||
| 48192 | Sun Fire[TM] 12K/15K: Dstop: Slot0 target slot transgression error | 31 Oct 2002 |
| Status | Issued |
| Description |
- Problem Statement:
Dstop: Slot0 target slot transgression error
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load dsmd.dstop.020207.0007.29
02 Created Thu Feb 7 00:07:30 2002
03 By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09 executing as pid=14740
04 On ssc name = xc46-sc0.SD_Lab.West.Sun.COM
05 Domain = 0=A Platform = sun15
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 00011
08 Slot0[17:0]: 00010
09 Slot1[17:0]: 00000 Requested/not enabled: 00001
10 'Not enabled' refers to the Console Bus master port on the parent board.
11 -D option, -d
12 "DSMD DomainStop Dump"
13 Created in a Sun Microsystems Inc. internal environment.
14 0 errors occurred while creating this dump.
15 redxl> wfail
16 SDI EX04/S0 Master_Stop_Status0[31:0] = D004000A
17 MStop0[3,1]: Slot 0 port is DStopped, SDI is Recordstopped.
18 SDI EX04/S0 Dstop0[31:0] = 00828080
19 Dstop0[17]: D DARB texp requests Slot0 Dstop (M)
20 Dstop0[23]: D 1E SDI internal Slot0 port requested Dstop
21 SDI EX04/S0 Slot0_Error1[31:0] = 00088008 Mask = 31444EBF
22 S0Err1[19]: D 1E Slot0 target slot transgression error (M)
23 {texp[4:0],targ_dev[2:0],s0dtarg,s0dstat[1:0],
24 s0dtransid[8:0]} = 04844
25 FAIL Slot SB4: Dstop/Rstop detected by SDI
26 Primary service FRU is Slot SB4.
27 Secondary service FRU is EXB EX4.
SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this Dstop was generated by dsmd (lines 11,12)
while a domain was active. This is also evident by the dumpf file name -
dsmd.dstop files are created by dsmd as part of an ASR.
The header also reports that hardware state for IO0 was not collected in the
dump (lines 09,10). As wfail indicates, the reason for this is that IO0's console
bus master was not enabled. The console bus master for IO0 is SDI4 on EX0. EX0
therefore warrants further investigation. Let's finish with wfail first. Walking
the error chain:
- The SDI on EX4 calls for Dstop with an internal error with respect to
its Slot 0 port (lines 18-20).
- The SDI register flagged a target slot transgression error (lines 21-24).
- wfail calls out SB4 as what FAILed and also the primary FRU (lines 25,26).
EX4 is marked as the secondary FRU (line 27).
A transgression is an attempt, successful or unsuccessful, to communicate with
a board that is not participating in your domain. By the errors present, SB4
attempted such an operation. The master SDI maintains bit vectors for its slot
boards which outline the other boards each is permitted to talk with. Looking
at the master SDI on EX4:
28 redxl> shsdi 4
29 Note: Data is displayed from the currently loaded dump file.
30 SDI EX04/S0 Component ID = 64317049
31 Master_Reset_Config[31:0] = 04000000
32 Master_Stop_Config[31:0] = 41001997
33 Core_Config[21:0] = 0DB3E2
34 Sysreg_Config[23:0] = 200001
35 STB_Config[23:0] = 20010F
36 Bogon_Config[63:0] = 00000003 C03C0010
37 CP_Config[20:0] = 0F0F70
38 Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150
39 Slot1Config[1:0][31:0,31:0] = 2000E0F0 28A83880
40 Slot0_Domain_Mask[17:0]: Slot1 = 00000 Slot0 = 00010
41 Slot0_Expand_Mask[17:0]: Slot1 = 00000 Slot0 = 00010
42 Slot1_Domain_Mask[17:0]: Slot1 = 00010 Slot0 = 00001
43 Slot1_Expand_Mask[17:0]: Slot1 = 00010 Slot0 = 00001
Looking at the Slot 0 domain and expander masks (lines 40,41), we see that
this SDI's Slot 0 board (SB4) is permitted to communicate with Slot 0 on EX4 -
in other words, only itself. This is not much of a domain to say the least.
But it does explain why the transgression error was detected: As soon as SB4
attempted any data transmission outside itself, the SDI considers that
transmission invalid.
The question now is why is EX4's master SDI programmed this way? It does not
reflect a valid domain configuration. But, based on boards requested in the
dump header (lines 8,9), we can infer that the valid domain contained SB4 and
IO0. We previously noted that EX0 warranted a deeper look. Let's do that now:
44 redxl> shsdi 0
45 Note: Data is displayed from the currently loaded dump file.
46 SDI EX00/S0 Component ID = 64317049
47 Master_Reset_Config[31:0] = 00000018
48 Master_Stop_Config[31:0] = 41000897
49 Core_Config[21:0] = 0DA3C2
50 Sysreg_Config[23:0] = 200001
51 STB_Config[23:0] = 20010F
52 Bogon_Config[63:0] = 00000003 C03C0010
53 CP_Config[20:0] = 0F0F70
54 Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150
55 Slot1Config[1:0][31:0,31:0] = 0000E000 00000000
56 Slot0_Domain_Mask[17:0]: Slot1 = 00010 Slot0 = 00001
57 Slot0_Expand_Mask[17:0]: Slot1 = 00010 Slot0 = 00001
58 Slot1_Domain_Mask[17:0]: Slot1 = 00000 Slot0 = 00000
59 Slot1_Expand_Mask[17:0]: Slot1 = 00000 Slot0 = 00000
60 Force_Error[1:0][31:0] = 0000E000 00000000
61 Csr2Conf[28:0] = 01000000
62 IBIST_Enbl[1][3:0],[0][29:0] = 0 00000000
63 Master_Stop_Status0[31:0] = F0000000
64 Master_Stop_Status1[31:0] = 7F7F0000
65 Dstop0[31:0] = 00000000
66 Dstop1[31:0] = 00000000
67 Recordstop0[31:0] = 00000000
68 Recordstop1[31:0] = 00000000
69 Core_Error0[31:0] = 00000000 Mask = 0051FFFF
70 Core_Error1[31:0] = 00000000 Mask = FFFFFFFF
71 Sysreg_Error[31:0] = 00000000 Mask = 780377FF
72 STB_Error[31:0] = 00000000 Mask = 7F00FFFF
73 CP_Error0[31:0] = 00000000 Mask = 580067FF
74 CP_Error1[31:0] = 00000000 Mask = 7FFCFFFF
75 Slot0_Error0[31:0] = 00000000 Mask = 7000FFFF
76 Slot0_Error1[31:0] = 00000000 Mask = 31444EBF
77 Slot0_Error2[31:0] = 00000000 Mask = 7FFCFFFF
78 Slot1_Error0[31:0] = 00000000 Mask = FFFFFFFF
79 Slot1_Error1[31:0] = 08000000 Mask = FFFFFFFF
80 Slot1_Error2[31:0] = 00000000 Mask = FFFFFFFF
No errors, not even a DARB request for Dstop. Also, the masks for Slot 1
(lines 59,60) do not permit any transmissions from IO0. IO0 is part of the
defined domain, as indicated by the dump header (line 9). Again, the question
is why is the SDI programmed in such a manner? The ASICs can only be programmed
from the SCs. It's time to look outside the dump file.
The dump header tells us the dump was taken at Feb 7, 00:07:30 2002 (line 2).
Looking in the platform message log, we see the following messages reported
around that time:
Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708520653855 ERR PciComm.cc 195]
Cannot access console bus since the board IO0 is OFF
Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708521469608 ERR IosramComm.cc 516]
Failed to read from offset 1e for key 53444344
Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708540886871 ERR PciComm.cc 195]
Cannot access console bus since the board IO0 is OFF
Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708541793044 ERR IosramComm.cc 516]
Failed to read from offset 1e for key 53444344
Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708543136069 ERR PciComm.cc 195]
Cannot access console bus since the board IO0 is OFF
Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708543957908 ERR IosramComm.cc 516]
Failed to read from offset c for key 53444344
Feb 7 00:07:35 2002 xc46-sc0 dsmd[448]: [2517 37708971377554 WARNING EventHandler.cc 155]
Domain stop has been detected in domain A.
IO0 is reported as being off. There are also errors reported for key 53444344;
this is an IOSRAM key (the IosramComm.cc is a big clue here). IOSRAM access is
via console bus and console bus requires power. Thus, we can be highly confident
that IO0 was powered off. Since there were not esmd messages indicating a power
off reason, it was likely done by a system administrator.
It's likely that once we saw the Requested/not enabled report from wfail (line 9)
we'd have began to look outside the dump file. But for the purposes of this
discussion, the walk through the hardware provided some insight on why the
transgression error was reported. Also, this is a prime example of the limitations
of wfail. wfail can only report and analyze the data made available to it.
- Resolution:
Repair/replace IO0.
- Summary of part number and patch ID's
- References and bug IDs
SunSolve Article 48122
- Additional background information:
With the conclusion that IO0 was powered off, the programming of the SDIs
makes sense. At power off, the power libraries remove the powered off board(s)
from the masks in the SDIs (and AXQs) to ensure no transactions can be sent to,
or more importantly source from, those board(s).
This in turn also explains why the master SDI on EX0 did not report any errors.
We know from the error reporting in the hardware that the DARBs broadcast a
stop request to all master SDIs in the system. The SDI examines the
request to determine if either of its Slot 0/1 boards needs to be stopped.
By the time the SDI on EX0 received the stop request, IO0 had already
been removed from the mask registers. Thus, the SDI determined it had no slots
that needed to participate in the stop.
- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:
- Keywords
15K, 12K, SF15K, SF12K, starcat, dstop, Slot0 target slot transgression error
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: