| SRDB ID | Synopsis | Date | ||
| 47414 | Sun Fire[TM] 12K/15K: dstop; Timeout on command reissue transaction to Slot0 | 22 Oct 2002 |
| Status | Issued |
| Description |
During heavy I/O loads, a Sun Fire[TM] 12K/15K domain dstops. Here is an example of the wfail output:
redxl> wfail
SDI EX00/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX01/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX01/S0 Recordstop0[31:0] = 00818001
Rstop0[16]: R 1E DARB texp request Recordstop (M)
Rstop0[23]: R AXQ requests all Recordstop (M)
SDI EX02/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX03/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX04/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX04/S0 Recordstop0[31:0] = 00818001
Rstop0[16]: R 1E DARB texp request Recordstop (M)
Rstop0[23]: R AXQ requests all Recordstop (M)
SDI EX04/S0 Core_Error0[31:0] = 02008200 Mask = 0051FFFF
CoreErr0[25]: D 1E Command pool timeout, non-split exp (M)
valid_{slot_wr[1:0],read}_TO = 1 (rev 4+)
{cmd_pool_loc[5:0],cmd4io,retired,half_used} = 020
SDI EX04/S0 STB_Error[31:0] = 00018001 Mask = 7F00FFFF
STBErr[16]: D 1E STB entry timeout
{loc[4:0],stb_full[1:0],retired,half_used,reord} = 03C
SDI EX05/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX05/S0 Recordstop0[31:0] = 00818001
Rstop0[16]: R 1E DARB texp request Recordstop (M)
Rstop0[23]: R AXQ requests all Recordstop (M)
SDI EX06/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX06/S0 Recordstop0[31:0] = 00818001
Rstop0[16]: R 1E DARB texp request Recordstop (M)
Rstop0[23]: R AXQ requests all Recordstop (M)
SDI EX07/S0 Master_Stop_Status0[31:0] = 4004004F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX07/S0 Recordstop0[31:0] = 00818080
Rstop0[16]: R DARB texp request Recordstop (M)
Rstop0[23]: R 1E AXQ requests all Recordstop (M)
AXQ EX07 ( 7) Error_Flag_00[31:0] = 00048004 Mask = 00047FFB
Err0[18]: R 1E Timeout on command reissue transaction to Slot0
FAIL Slot SB7: Dstop/Rstop detected by AXQ.
The FRU for this failure cannot be identified from the available information.
This error is not diagnosable. The FAIL action is just a guess to
satisfy the POST design requirement that something must be
deconfigured after a stop to guarantee that the process terminates.
The FAILed component is no more suspect than any other hardware
in the domain.
SDI EX08/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX09/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX10/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX11/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX12/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX12/S0 Recordstop0[31:0] = 00818001
Rstop0[16]: R 1E DARB texp request Recordstop (M)
Rstop0[23]: R AXQ requests all Recordstop (M)
SDI EX13/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX14/S0: All SDI is DStopped and RStopped, requested by DARB.
DARB C0: enabled ports (expanders) [17:0]: 0FFFF
DARB C0: exps request Rstop [17:0]: 00080
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 08080
DARB C1: enabled ports (expanders) [17:0]: 0FFFF
DARB C1: exps request Rstop [17:0]: 00080
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 08080
redxl> shaxq 7
Note: Data is displayed from the currently loaded dump file.
AXQ EX7 (7) Component ID = C4312049 Rev 6.0
ExpID[4:0] = 07
Config0[31:0] = 1B380CF9
Config1[31:0] = 00249BC0
Timeout_Conf 1[19:0] = 7BDEF 0[31:0] = 1EF7BE0F
Sec_Config[22:0] = 000000
Csr0_status[4:0] = 0F
ID_Mask[31:0] = 00000000 Home_Mask[31:0] = 00000000
Flow_Ctl_Config[28:0] = 00CF0888
Config6[31:0] = 00000000
Config4[31:0] = 09C00000
Slot0_Domain_Mask[17:0]: Slot1 = 0FFFF Slot0 = 0FFFF Where Slot SB7
Slot0_DomInt_Mask[17:0]: Slot1 = 00000 Slot0 = 0FFFF can send.
Slot1_Domain_Mask[17:0]: Slot1 = 00080 Slot0 = 0FFFF Where Slot IO7
Slot1_DomInt_Mask[17:0]: Slot1 = 00000 Slot0 = 0FFFF can send.
Error_Flag_00[31:0] = 00048004 Mask = 00047FFB
Err0[18]: R 1E Timeout on command reissue transaction to Slot0
Port[1:0] = 2 ATransID[3:0] that timed out = 9
reqagent_errsave0[2:0][31:0] = 0000 00000000 00520000
Error_Flag_01[31:0] = 00000000 Mask = 40047FFB
Error_Flag_02[31:0] = 00000000 Mask = 0000FFFF
Error_Flag_03[31:0] = 00000000 Mask = 21005EFF
Error_Flag_04[31:0] = 00000000 Mask = 01FEFFFF
Error_Flag_05[31:0] = 00000000 Mask = 1024FFFF
Error_Flag_06[31:0] = 00000000 Mask = 7E00FFFF
Error_Flag_07[31:0] = 00000000 Mask = 63FF7D24
Error_Flag_08[31:0] = 00000000 Mask = 0000FFFF
Error_Flag_09[31:0] = 00000000 Mask = 7E00FFFF
Error_Flag_10[31:0] = 00000000 Mask = 7C00FFFF
Error_Flag_11[31:0] = 00000000 Mask = 7FF0FFFF SOLUTION SUMMARY:
Explanation:
The problem appears on the data path between the system board and the expander. The AXQ chip on the expander has a command reissue timeout, as detailed by the line, "1E Timeout on command reissue transaction to Slot0". It implicates slot0 (the system board), but that error is historically a "victim" error. This means that the error is most likely the source of the transaction, not the destination. So, in this case the data transaction is going from EXB to the SB, and the EXB exceeds the data transaction timeout which the SDI (System Data Interface) detects and thus prompts the Master Stop on the domain, resulting in the dstop condition.
Action:
In this specific case, and in most similer cases, this error is seen under heavy I/O load conditions, such as in a benchmark test or SunVTS testing. The AXQ 6.0 (and below) chip itself is susceptible to this type of failure under heavy I/O loads. Lighter I/O loads can still generate the error, but not in the same frequency of failures as a heavy load situation would produce.
There were several escalations opened against this issue. The current practice (where parts exist) is to swap out the EXBs on the platform (all of them!) for the newer EXBs with the AXQ 6.1 chip installed.
See Bug ID
FCO
Keywords: timeout, dstop, command, reissue, transaction, slot0
INTERNAL SUMMARY:
SUBMITTER: Joshua Freeman BUG REPORT ID: 4505200, 4508788 APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS: