현재 위치 - 법률 상담 무료 플랫폼 - 회사 전체 - 如何诊断节点重启问题 Oracle官方博客
如何诊断节点重启问题 Oracle官方博客

如何分析这种问题了?先看系统日志,像他这个是HP-UX,那么系统日志为/var/log/syslog/syslog.log,AIX是errpt

在系统日志中,我看到:

Nov 11 18:43:57 rx8640c syslog: Oracle CSS family monitor shutting down. 3

Nov 11 18:43:59 rx8640c su: + tty root-oracle

Nov 11 18:43:59 rx8640c syslog: Cluster Ready Services completed waiting on dependencies.

在对比ALERT日志,发现系统基本是在这个时候重启的

Wed Nov 11 18:43:28 2009

Trace dumping is performing id=[cdmp_20091111184328]

Wed Nov 11 18:57:17 2009

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

如果是AIX系统,可以用last shutdown看看,HP我不知道是不是这个

这里,在syslog.log中可以看到,CSS进程shutdown(这个意思是偶猜的),CSS关闭或异常,会自动重启主机,符合现在的情况

接下来就是分析ORA_CRS_HOME中的ocssd日志了

[ CSSD]2009-11-11 18:39:18.460 [13] >WARNING: clssgmAssignMemberNo(): grock(#CSS_CLSSOMON) memberNo(1) already assigned

[ CSSD]2009-11-11 18:39:34.313 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 50% heartbeat fatal, eviction in 14.807 se

conds

[ CSSD]2009-11-11 18:39:35.313 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 50% heartbeat fatal, eviction in 13.807 se

conds

[ CSSD]2009-11-11 18:39:42.313 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 75% heartbeat fatal, eviction in 6.807 sec

onds

[ CSSD]2009-11-11 18:39:45.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:45.314 [14] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)

[ CSSD]2009-11-11 18:39:46.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:46.314 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 90% heartbeat fatal, eviction in 2.807 sec

onds

[ CSSD]2009-11-11 18:39:47.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:47.314 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 90% heartbeat fatal, eviction in 1.807 sec

onds

[ CSSD]2009-11-11 18:39:48.313 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:48.314 [14] >WARNING: clssnmPollingThread: node rx8640c (1) at 90% heartbeat fatal, eviction in 0.807 sec

onds

[ CSSD]2009-11-11 18:39:49.133 [14] >TRACE: clssnmPollingThread: node rx8640c (1) is impending reconfig

[ CSSD]2009-11-11 18:39:49.134 [14] >TRACE: clssnmPollingThread: Eviction started for node rx8640c (1), flags 0x000f, state 3,

这个日志信息很明显了,私有网络心跳丢失,节点被驱除

至于为什么私有网络出现问题,心跳丢失,我想这个不是DBA能处理的了,写个报告丢给管网络的去看吧

另外提下,可能造成节点重启的进程有3个,OCSSD,OPROCD,OCLSOMON

一般的,OCSSD的原因就是心跳丢失(网络心跳或者投票磁盘出现问题)和CSS进程请求不到CPU资源和BUG;OPROCD,OCLSOMON的原因是进程请求不到CPU资源和BUG

他这里在节点重启前,还顺便报了个600错误

Wed Nov 11 18:43:27 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_24884.trc:

ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []

确认是个Bug 5486074

ORA-600 [keltnfy-ldminit] can occur in the Server Generated Alert

subsystem when it cannot determine the Host Name or

Network Address. This can be caused by DNS server being unaavilable.

查了下,没说这个错误会导致CSS死亡,主机重启的,而该错误应该是客户端报出来的。。。

至少说可以确认网络出现过问题

启动的时候,报错

Wed Nov 11 18:58:06 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_7203.trc:

ORA-00600: internal error code, arguments: [ksprlspeeq3], [65536], [], [], [], [], [], []

Wed Nov 11 18:58:07 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_7203.trc:

ORA-07445: exception encountered: core dump [kgscDump()+801] [SIGSEGV] [Address not mapped to object] [0x000001004] [] []

ORA-00600: internal error code, arguments: [ksprlspeeq3], [65536], [], [], [], [], [], []

Wed Nov 11 18:58:08 2009

Errors in file /oracle/app/oracle/admin/ora10g/udump/ora10g1_ora_7203.trc:

ORA-07445: exception encountered: core dump [kgscDump()+801] [SIGSEGV] [Address not mapped to object] [0x000001004] [] []

ORA-07445: exception encountered: core dump [kgscDump()+801] [SIGSEGV] [Address not mapped to object] [0x000001004] [] []

ORA-00600: internal error code, arguments: [ksprlspeeq3], [65536], [], [], [], [], [], []

ORA-07445[kgscDump]对应有个Bug 5508574 - OERI[504] / OERI[99999] / Dump [kgscdump] with > 31 CPUs,可是系统只有15C,30核。

ORA-00600[ksprlspeeq3]这个没找到10203相关的BUG,先也懒的管了

推荐一个METALINK的note:4.1,这个就是以前的knowledge,里面有很多归类的文章,和一些工具的列表