| SRDB ID | Synopsis | Date | ||
| 48122 | Sun Fire[TM] 12K/15K: An Overview of Dstop Diagnosis | 29 Oct 2002 |
| Status | Issued |
| Description |
- Problem Statement:
Dstop Diagnosis - An Overview
- Symptoms:
A domain suffers a Dstop. Messages in the platform log are similar
to the following:
Apr 25 08:07:43 2002 sc0 hwad[282]: [1156 3601896159843595 ERR
InterruptHandler.cc 2159] Domain Stop interrupt detected, domain B
Alternatively, during POST execution, the following is noted in the
POST log:
DSTOP Detected for Slot SB13
SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
System state dumped to /var/opt/SUNWSMS/SMS1.2/adm/F/dump/xcstate.020805.0427.40
SOLUTION SUMMARY:
- Troubleshooting:
A Dstop occurs when the system hardware encounters a fatal error. The
source/cause of the stop can and does vary. There is no single answer for
how to diagnose a Dstop. The intention of this article is to provide a
framework in which to base a Dstop analysis. Specifics on interpreting
'redx' output is left to other articles dealing with specific Dstops,
although a brief overview of the objectives of 'wfail' is listed in "Background
Information" below.
A general approach to Dstop analysis is:
1. Examine the hardware state dump.
This is the obvious first step. Load the state dump into 'redx' and execute 'wfail'.
Examine the output from the standpoint of how errors are recorded and
reported within the Starcat hardware. Refer to "Background Information"
below for further details on error recording/reporting.
In many cases, the cause of the Dstop is obvious. 'wfail' identifies
either a single suspect component, or at most two suspects separated
by a single interconnect (i.e., a System Board and Expander). However, if
the cause of the Dstop is not obvious (i.e., multiple, disparate
suspect components, undiagnosable timeout), the state dump alone may
not be sufficient to identify the source of the Dstop.
2. Characterize the platform at the time of the Dstop.
When the source/cause of a Dstop is not obvious from the state dump,
it is wise to characterize the platform activity just prior to the
Dstop. If multiple Dstops are present in a short time span, focus
initially on the first Dstop.
Sources of information include platform message logs, domain message
logs, domain console logs, and POST logs. Items of interest include
environmental fluctuation (temperature, voltage, etc.), actions on
shared components (expanders, centerplanes), or evidence of user
actions that could impact the Dstopped domain(s).
3. Consider recent changes/services to the system and the failure history.
When both the state dump and platform characterization do not yield
strong suspect component(s), another avenue of investigation is
recent changes to the platform/domain. Changes include hardware
replacements, upgrades, and additions. Software changes are also
notable, such as updated patches to SMS or Solaris[TM] on the domain.
An examination of the failure history of a platform/domain may also
reveal a trend or pattern that may suggest a suspect component.
Such investigation may be the only recourse when faced with an
undiagnosable timeout.
From the above steps, the goal is to identify one or more suspect
components. Once identified, there are two goals: problem resolution and
problem avoidance.
Problem Resolution:
Typically, but not exclusively, a Dstop translates into a component
replacement. Once the suspect component is removed from the system
and its replacement installed and tested, resolution is achieved. In
'redx', the suspect component(s) to be replaced is(are) indicated by the
"Primary/Secondary FRU" lines in 'wfail' output. For example:
Primary service FRU is EXB EX8.
Secondary service FRU is CSB C1 or the logic centerplane.
Or, in the case of iterative debugging sessions, other observed
behaviors in the system (POST failures, panics, etc.) may be the
indicator of a suspect component.
However, customer and/or system constraints may delay the scheduling
of a maintenance window for component replacement. Hence, problem
avoidance becomes key.
Problem Avoidance:
As mentioned above, resolution may be days/weeks/months into the
future. However, it may be possible to avoid the source of the
problem in the interim. In 'redx', components listed after "FAIL" in
'wfail' output indicate those components to remove from the
configuration to avoid the problem. For example:
FAIL Port SB14/P0: Dstop detected by SDC
The same style of text is also present in POST logs when a Dstop is
detected during POST. Components reported as FAILed do NOT necessarily
equate to the FRU for problem resolution. In this example, a processor
is FAILed, but a processor is not a FRU in the Starcat - the system
board is the FRU.
The problem can be avoided by blacklisting the component(s) listed
as FAILed. Blacklisting can also be a useful method for verifying a
diagnosis prior to replacing any hardware. In cases where there are
multiple suspect component(s), one component can be blacklisted. If the Dstop
does not return, confidence is raised that the blacklisted component
is the problem source. And, of course, blacklisting provides an
excellent interim solution to maximize domain uptime.
- Resolution:
Specific resolution varies from case to case. Typically, resolution to
a Dstop is the replacement of a failed component. Other possible resolutions
can be application of software patches to SMS/Solaris.
- Summary of part number and patch ID's
Various.
- References and bug IDs
Other knowledge articles discuss specific Dstops in greater detail.
http://cpre-amer.west.sun.com/esg/hsg/starcat/xctt/redx_dumpanalysis.html
http://esp.west.sun.com/starcat/post/dstop_101.html
- Additional background information:
'wfail' Objectives:
'redx' provides the 'wfail' command (pronounced "w'fail"), as in
"what failed". It is the first line of attack for analyzing a
hardware state dump. 'wfail' performs a scan of all the DARBs and
master SDIs in the dump and reports any errors captured in those
ASICs. If the captured data indicates errors present in other ASICs,
'wfail' follows that chain and displays the errors in those ASICs
as well. Finally, 'wfail' reports what component to FAIL from the
configuration and the service FRU(s), if any.
Understanding the objectives of 'wfail' is crucial to know what the
command is useful for, and what its limitations are. 'wfail' has
three objectives:
1. Report the errors detected in the hardware.
'wfail' generally tries to report which errors occurred first,
as these are most interesting for diagnosis.
2. Report what resource(s) should be deconfigured (FAILed) from
the configuration to leave the maximal fault-free configuration.
The semantic 'wfail' uses is precisely equivalent to what POST
would choose to do if the error(s) occurred during POST. The
FAILed component is NOT necessarily the broken component.
Understanding this point is key. But, by preventing the domain
from using the FAILed component, the error can be avoided.
For example, suppose the centerplane has a problem communicating
with Expander 3. 'wfail' will not FAIL the centerplane. This
impacts the entire platform, potentially making the platform
unusable. Rather, the expander is FAILed and the remainder of
the system is usable. Even if by some means 'wfail' knew the
fault was on the centerplane (which it can't), the expander would
still be FAILed. Overall, this has less impact to the platform.
Thus, the maximal fault-free configuration is provided.
3. Recommend which FRU(s) to replace.
The key word is recommend. If the fault is across an interconnect,
such as the centerplane/expander example above, isolating to a
single FRU is not possible. Such is the nature of an interconnect
architecture. However, 'wfail' will call out a primary FRU and,
when applicable, a secondary FRU in such cases. The troubleshooter
must make a judgment on which component to target for replacement.
Error Recording:
The error registers in the 12K/15K ASICs are each 32 bits in size and
can record up to 15 different errors. The registers are organized to
distinguish between first errors and accumulated errors. This
organization applies to the SDIs, AXQs, and most of the L1 board
ASICs, although the handling of the accumulated error bits differ
slightly. A typical error register is as follows:
31 16 15 14 0
+------------------------------------------------------------+
| | | | |
| | Accum Error Flags [30:16] |1E| First Error Flags [14:0] |
| | | | |
+------------------------------------------------------------+
1E = 1st Error
The first error bits [14:0] are only set if no other errors have
already been recorded in the register. Bit [15] is set only if a
bit [14:0] is being set and no other error registers in the ASIC
already have bit [15] set. Thus, an ASIC can accurately report the
first error it encountered. Note that the first error is with respect
to the ASIC. It does not necessarily indicate the first error within
the domain.
The accumulated error bits [30:16] have a 1-to-1 relationship to
the first errors [14:0]. For the SDI and AXQ, the accumulated errors
are always set when the corresponding first error bit is set. Consider
the following example:
SDI EX14/S0 Dstop0[31:0] = 10019000
Dstop0[16]: D DARB texp requests all Dstop (M)
Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
The register's bits are 10019000. This equates to bits 12, 15, 16 and 28
being set. The first error is bit 12. It is the only bit set in the [14:0]
range. When bit 12 is set, bit 15 is also set so the ASIC records
subsequent errors in the accumulated area. Also, when bit 12 is set, its
corresponding accumulated bit 28 is set. Sometime after the first error
event, another error is recorded and bit 16 is set. It does not have a
corresponding first error because the first error has already occurred.
'redx' decodes the accumulated error bits [30:16] in the register. This
is intentional - we want to see all of the errors that were recorded in
the ASIC. To distinguish which error occurred first, first errors are
flagged with a '1E'.
For L1 board ASICs, accumulated error bits are treated slightly differently.
Unlike the AXQ and SDI, accumulated error bits are not set until a
repeated error of the same type occurs. Consider this example:
SDC SB14 PortErr [0][25:0] = 0028001 (Safari Port 0)
P0Err[ 0]: 1E Parity Bidi error
P0Err[ 17]: Parity Single error
The register's bits are 0028001. This equates to bits 1, 15 and 17 being
set. The first error bit is 1. It is the only bit set in the [14:0] range.
When bit 1 is set, bit 15 is also set to the ASIC records subsequent errors
in the accumulated area. Unlike the AXQ and SDI, bit 16 (bit 1's
corresponding accumulated bit) is not set. Sometime after the first error
event, another error is recorded and bit 17 is set.
As before, 'redx' denotes the first error with a '1E'. If repeated errors
of the same type occur, the notation becomes '1E+'.
For diagnosis purposes, first errors are of most interest. Note that it
is possible for an ASIC to have multiple first errors set in its error
registers, which indicates they were all set on the same clock cycle.
Error Reporting:
Error reporting in the 12K/15K is a hardware tree. There is an error
concentrator at each level of the interconnect, with the root of the tree
at the SC. Error reporting happens in three "waves". The first wave is to
centralize the stop request in the centerplane, specifically the DARBs.
The second wave is to notify the remainder of the domain resources that
a stop request is in process. And finally the third wave informs the System
Controller a stop condition exists in the hardware.
First wave:
The ASIC detecting an error reports the error to its nearest error
concentrator. The error concentrators are the EPLDs for L1 boards, the
Master SDI for the expanders, and the DARBs for the centerplane.
o Any errors detected within an L1 board are reported to the board's EPLD.
If the error is an ECC error, the EPLD asserts ECC_ERROR to its expander's
Master SDI. For all other error types, the EPLD asserts ERROR to its Master
SDI. The Master SDI in turn notifies the DARBs via the texp bus.
o Any errors detected within an expander are reported to the expander's
Master SDI. The Master SDI in turn notifies the DARBs via the texp bus.
o Any errors detected within a centerplane half are reported to the DARB
servicing that centerplane half via the Xstop bus. The DARBs inform each
other of any stop requests via the notify (Ntfy) wires connecting the
DARBs.
Second wave:
Once the detected error(s) has "bubbled" up to the DARBs, it is the DARB
that declares Dstop and/or Rstop to the remainder of the platform resources.
The DARBs notify the AMXs/RMXs/DMXs via the Xstop busses and all Master
SDIs via the texp bus. The stop demand message defines the type of stop
and the expander/slot(s) in error. The ASIC receiving the demand message
in responsible for stopping its appropriate ports/slots.
o The AMXs/RMXs/DMXs examine the stop message to determine the port (expander)
in error. If that expander is not a split expander, all other
non-split expander ports to which the errored port can communicate with
are stopped. If the port (expander) in error is a split expander, only
the errored port is stopped.
The centerplane ASICs cannot blindly stop a port that routes to a split
expander. Such action could inappropriately halt a domain that is not
in error, breaching domain isolation. In such cases, stopping the expander
is deferred to the Master SDI on that expander.
o The Master SDIs examine the stop message to determine if it services any
L1 boards to which the port (expander) in error can communicate with. The
Master SDI uses its configuration registers that define domain membership
to make this determination. If the Master SDI determines one or both of
its L1 boards must be stopped, it asserts ERR_PAUSE to that L1 board's AR.
The Master SDI will also stop itself, the slave SDIs, and the AXQ. In the
case of a split expander, the appropriate halves of the SDIs and AXQs are
stopped.
Third wave:
When all domain resources have processed the stop message, the DARBs raise an
interrupt to the System Controller to signal the hardware requests service.
hwad in SMS services the interrupt, examines the DARBs to determine the stop
type and also examines the SDIs to determine which domain(s) are impacted. The
hardware state dump is taken and SMS proceeds to recover the domain(s).
The diagram below details the error reporting flow, busses, etc.
CENTERPLANE EXPANDER SLOT 0
############################################ ####################### ########################
# # # # # #
# +------+ +------+ +------+ +------+ # # +-----+ # # #
# | AMX0 | | RMX0 | | AMX1 | | RMX1 | # # +---| AXQ |<--+ # # #
# | x2 | | | | x2 | | | # # | +-----+ | # # #
# +--^---+ +--^---+ +---^--+ +---^--+ # # Stop | # # #
# | | | | # # | +-----+ | # # #
# | | | | # # |+--| SDI |<-+| # # #
# | | | | # # || +-----+ || # # #
# | | | | # # Requests Stop # # #
# XStop bus | Xstop bus | # # || +-----+ || # # +------+ +------+ #
# | | | | # # ||+-| SDI |<-+| # # | AR | | EPLD | #
# | Xstop bus | Xstop bus # # ||| +-----+ || # # +--^---+ +-^--^-+ #
# | | | | # # ||| Demands# # | | | #
# | | | | # # ||| +-----+ || # ######|########|##|#####
# | | | | # # ||+>| |--+| # | | |
# +-v---------v-+ | | # # |+->| |---+ # | | |
# | | | | # # +-->| |-----ERR_PAUSE---+ | |
# | | | | #t # | |<----ECC_ERROR_S0---------+ |
# | DARB0 |<-------|---------|------e-b------->| |<----ERROR_S0----------------+
# | | | | #x u# | SDI | #
<--------| | +-v---------v-+ /--p-s------->| (M) | #
Intr # | |<---->| | | # # | |<----ERROR_S1----------------+
to # +------^------+ Ntfy | |<+ # # +-->| |<----ECC_ERROR_S1---------+ |
SCs # | | DARB1 | # # |+->| |-----ERR_PAUSE---+ | |
<---------------|-------------| | # # ||+>| |--+ # | | |
# | | | # # ||| +-----+ | # | | |
# | | | # # Stop | # ######|########|##|#####
# | +-------^-----+ # # ||| +-----+ | # # | | | #
# | | # # ||+-| SDI |<-+ # # +--v---+ +------+ #
# Xstop bus | # # || +-----+ | # # | AR | + EPLD + #
# | | # # Requests Stop # # +------+ +------+ #
# | Xstop bus # # || +-----+ | # # #
# | | # # |+--| SDI |<-+ # # #
# | | # # | +-----+ | # # #
# +---v--+ +---v--+ # # | Demands# # #
# | DMX0 | | DMX1 | # # | +-----+ | # # #
# | x6 | | x6 | # # +---| SDI |<-+ # # #
# +------+ +------+ # # +-----+ # # #
# # # # # #
############################################ ####################### ########################
SLOT 1
As an example, take the case where the AXQ detects a parity error. Assume that
there are no split expanders and both Slot 0 and Slot 1 are part of the domain
in error.
1. The AXQ sends a stop request to its Master SDI.
2. The Master SDI reports this to the DARBs via the texp bus.
3. The DARBs broadcast the stop request to the AMXs/RMXs/DMXs and all
Master SDIs in the system.
4. The Master SDIs examine the stop message and, if appropriate, assert ERR_PAUSE
to the ARs of the L1 boards in the domain. In this example, ERR_PAUSE is
asserted to both Slot 0 and Slot 1.
5. The DARBs raise an interrupt to the System Controller.
- Meta-Data/Problem categorization:
Product/Platform: Sun Fire 12K/15K
Category:
- Keywords
dstop, error reporting, error recording, overview, primer, 15K, 12K, SF15K, SF12K, starcat INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: