| SRDB ID | Synopsis | Date | ||
| 48458 | Sun Fire[TM] 15K: Domain heartbeat failed | 31 Oct 2002 |
| Status | Issued |
| Description |
- Problem Statement:
Domain heartbeat failed and DSMD issues an XIR to recover
- Symptoms:
The issue surrounding this domain heartbeat type of failure is that the "forced OS to panic"
routine fails, the dsmd recovery action produces little useful information, and the dsmd
recovery action only runs level 7 hpost diagnostics.
These relevant messages will appear in the domain messages file relating to the failure:
Apr 29 11:56:17 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain heartbeat failed in state (K:20307: 1: 0).
Apr 29 11:56:17 2002 f15k1sc1-hme0 dsmd[17690]-K(): Forcing OS to panic
Apr 29 11:56:27 2002 f15k1sc1-hme0 dsmd[17690]-K(): Put Mailbox Message failed 1141
Apr 29 11:56:37 2002 f15k1sc1-hme0 dsmd[17690]-K(): Force OS panic timed out.
Apr 29 11:56:37 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain K OS is hung, aborting and rebooting domain.
Apr 29 11:56:49 2002 f15k1sc1-hme0 dsmd[17690]-K(): Sending XIR to every CPU in domain, rc = 0
Apr 29 11:56:50 2002 f15k1sc1-hme0 dsmd[17690]-K(): Taking CPU registers and IOSRAM domain data dump.
Apr 29 11:56:50 2002 f15k1sc1-hme0 dsmd[17690]-K(): XIR dump: /var/opt/SUNWSMS/SMS1.2/adm/K/dump/dsmd.dump.020429.1156.50
Apr 29 11:56:51 2002 f15k1sc1-hme0 dsmd[17690]-K(): Taking hardware configuration dump. Dump file:
-D/var/opt/SUNWSMS/SMS1.2/adm/K/dump/dsmd.hwconfig.020429.1156.50
Apr 29 11:59:47 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain K running OS - Solaris hostname is leo.
As noted in the messages, the mailbox communication to force the bootproc to panic times out.
The determination by dsmd that the domain is hung forces it to XIR the domain processors.
A dsmd.dump and a dsmd.hwconfig file are created.
Using showxirstate -f dsmd.dump.020429.1156.50 produces a cpu register dump of all the
processors in the domain. No immediately useful information is available.
Using redx to examine the dsmd.hwconfig.020429.1156.50 file reports "0 errors occurred
while creating this dump." and "No components would be failed based on this state."
SOLUTION SUMMARY:
- Troubleshooting:
Ideally, this failure occurs fairly consistently. The purpose of this procedure is to
gather additional data (a corefile and cpu signature states) that is relevant to the
failure.
A. The first thing to do is to disable dsmd recovery action for the domain:
- copy $SMSETC/config/dsmd_tuning.txt to $SMSETC/config/[A-R]/dsmd_tuning.txt.
It is imperative that the owner and permissions are maintained in order for
the dsmd daemon initialize the changed parameters. The permissions of the
$SMSETC/config/[A-R] directory must be:
drwxrwx---+ 2 root bin 512 Mar 27 17:09 K
Also, the owner and permissions of the dsmd_tuning.txt file must be:
-rw-r--r-- 1 root bin 1326 Dec 1 01:22 dsmd_tuning.txt
- Make the following changes in the domain dsmd_tuninig.txt file:
obp_heartbeat_time = 1200
os_heartbeat_time = 1200
domain_asr = 0
At this point, you must stop and restart SMS on the MAINSC (as root user):
/etc/inid.d/sms stop|start
NOTE: When you disable asr (domain_asr=0), a reset
from obp, either manually or automatic, will
not come back to obp. A Solaris reboot will
not work either. You must setkeyswitch off/on.
B. Download showcpusig from http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showcpusig.html.
Install on the MAINSC and apply execute permissions.
C. Make sure you have dumpadm set up to capture
a core file ; enough disk space etc...
D. On the next occurrence, the domain should noticeably hang and remain in that state.
It is now important to inform the SSE or customer to execute the showcpusig script
on the MAINSC. The showcpusig output should be useful in determining which cpu is
not responding.
sms-svc> ./showcpusig -d K
4. Interrupt the domain console in order to force a panic and a corefile
from OBP. This can be done by entering the key sequence '~#' in the console window and
typing sync at the OK> prompt.
- Resolution:
This is a temporary workaround to change the behaviour of the dsmd daemon in order to gather
additional data and aid in resolving the problem listed in the problem statement. Once this
data is gathered and the problem is understood, you are required to remove the dsmd_tuning.txt
settings and restart SMS on the MAINSC.
The resolution to the domain heartbeat failed problem will require analysis of the showcpusig
output and the vmcore file. That is beyond the scope of this article. However, see the references
section below for the meaning of known cpu signature states.
- References and bug IDs
- Summary of part number and patch ID's
Bug 4658538 reboot "fails" if ASR=0 for
domain/platform
Additional information regarding the showscpusig program can be found at the URL:
http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showcpusig.html. The output produces the
signature state of the domain (as is reported by the showplatform command), the
heartbeat of the domain, and the individual cpu signature states. The signature state
is decoded as follows:
4f530100 Solaris / Run / Null
The 1st 4 digits starting from the left are
decoded in the following table:
4f42 = OBP
4f53 = Solaris
4442 = Debug
The next 2 digits are decoded in the
following table:
00 = Non
01 = Run
02 = Exit
03 = Prerun
04 = Domain Stop
05 = Reset
06 = Power Off
07 = Detached
08 = Callback
09 = Offline
10 = Booting
11 = Unknown
12 = Error Reset
13 = Error Reset Sync
14 = Quiesced
15 = Quiesce In Progress
16 = Resume In Progress / \c" ;;
17 = Init
18 = Loading
The next 2 digits are decoded in the
following table:
00 = Null
01 = Halt
02 = Environment
03 = Reboot
04 = Panic
05 = Panic Con
06 = Hung
07 = Watch
08 = Panic Reboot
09 = Error Reset Reboot
10 = OBP Reset
11 = Debug
12 = Dump
13 = Failed
- Additional background information:
Some other information from the dsmd.dump file may be useful in determining the type of hang experienced and other appropriate
actions to take. For example, check the context
of esc 536982. From the dsmd.dump file, it was determined and hypothesized that the domain
had hung on a clock thread resulting in the heartbeat failure. By enabling the deadman kernel,
in addition to the actions listed above, a corefile was obtained and the problem was root
caused to a missing SRM patch. See the escalation for enabling the deadman kernel.
sms-svc> showxirstate -f dsmd.dump.020429.1156.50 |more
ver : 003E0015.21000507 US3+_2.1 EPIC6cu
tba : 00000000.10000000
pil : 0xB <<-----------------!!! PROCESSOR INTERRUPT LEVEL
y : 00000000.00000000
afsr : 00000000.00000000 afar : 00000402.CA001F00
afsr2 : 00000000.00000000 afar2 : 00000402.CA001F00
pcontext: 00000000.00000000 scontext: 00000000.00000000
dcu : 00000200.00000000
dcr : 00000000.0000103F
pcr : 00000000.00000000
gsr : 00000000.00000000
softint : 0x0400 <<-----------------!!! INTERRUPTS PENDING REGISTER
- Meta-Data/Problem categorization:
Product/Platform: SF15K
Category:
- Keywords
15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
hang, heartbeat, XIR, dsmd, deadman, showxirstate
INTERNAL SUMMARY:
SUBMITTER: Gino Valencia BUG REPORT ID: 4658538 APPLIES TO: Hardware/Sun Fire /15000 ATTACHMENTS: