My Journey as an Oracle DBA: RAC Server Rebooted Abnormally

There are many reasons of Server reboot.You need to analyze the log files to diagnose the root cause.

Issue
The Server test1(10.10.10.1) was abnormally rebooted on 26-02-2011.

Fact
Node 1 was evicted by node 2 because connections could not be made from node 2 to node 1.

Node 2 cssd.log
[ CSSD]2011-02-26 03:04:01.745 [10] >TRACE: clssgmClientConnectMsg: Connect from con(6000000000048870) proc(6000000000103df0) pid() proto(10:2:1:1)
[ CSSD]2011-02-26 03:04:32.370 [5] >TRACE: clssnm_skgxncheck: CSS daemon failed on node 0
[ CSSD]2011-02-26 03:04:32.371 [5] >TRACE: clssnmDiscHelper: node test1 (0) connection failed ========>>>>>>>>>>>> cannot connect to node test1
[ CSSD]2011-02-26 03:04:32.707 [16] >TRACE: clssnmDoSyncUpdate: Initiating sync 13
[ CSSD]2011-02-26 03:04:32.707 [16] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (597000)ms
[ CSSD]2011-02-26 03:04:32.707 [16] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2011-02-26 03:04:32.707 [16] >TRACE: clssnmSetupAckWait: node(0) is ALIVE
[ CSSD]2011-02-26 03:04:32.708 [16] >TRACE: clssnmSetupAckWait: node(1) is ALIVE
[ CSSD]2011-02-26 03:04:32.708 [16] >TRACE: clssnmSendSync: syncSeqNo(13)
[ CSSD]2011-02-26 03:04:32.708 [16] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(2)
[ CSSD]2011-02-26 03:04:32.708 [9] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[poiddb02] seq[1] sync[13]
[ CSSD]2011-02-26 03:04:32.708 [16] >TRACE: clssnmWaitForAcks: node(0) is expiring, msg type(11)
[ CSSD]2011-02-26 03:04:32.708 [9] >TRACE: clssnmHandleSync: diskTimeout set to (597000)ms
[ CSSD]2011-02-26 03:04:32.709 [16] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2011-02-26 03:04:32.709 [16] >TRACE: clssnmDoSyncUpdate: Terminating node 0, test1, misstime(21531) state(3)
[ CSSD]2011-02-26 03:04:32.709 [16] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2011-02-26 03:04:32.709 [1] >USER: NMEVENT_SUSPEND [00][00][00][03]
[ CSSD]2011-02-26 03:04:32.709 [16] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE
[ CSSD]2011-02-26 03:04:32.709 [16] >TRACE: clssnmSendVote: syncSeqNo(13)
[ CSSD]2011-02-26 03:04:32.710 [16] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2011-02-26 03:04:32.710 [9] >TRACE: clssnmSendVoteInfo: node(1) syncSeqNo(13)
[ CSSD]2011-02-26 03:04:32.711 [16] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2011-02-26 03:04:32.711 [16] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-02-26 03:04:32.712 [16] >TRACE: clssnmEvict: Start =====================>>>>>>>>>>>>>>>>>>>> Node 2 evicts node 1
[ CSSD]2011-02-26 03:04:32.712 [16] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2011-02-26 03:04:32.712 [16] >TRACE: clssnmWaitOnEvictions: Node(0) down, LATS(2088370148),timeout(21643)
[ CSSD]2011-02-26 03:04:32.712 [16] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2011-02-26 03:04:32.712 [16] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE
[ CSSD]2011-02-26 03:04:32.712 [16] >TRACE: clssnmSendUpdate: syncSeqNo(13)
[ CSSD]2011-02-26 03:04:32.713 [16] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(1)
[ CSSD]2011-02-26 03:04:32.713 [9] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (1293953635/1293953635) prevConuni(1293953635) birth (10/0) (old/new)
[ CSSD]2011-02-26 03:04:32.713 [9] >TRACE: clssnmDeactivateNode: node 0 (test1) left cluster

Findings
When two systems have access to the shared storage, integrity of the data depends on the systems communication through "HEARTBEATS" using the private interconnects. When the PRIVATE LINKS are LOST and FAILED,each system thinks the other system has exited the cluster, then it tries to become the master or form a sub-cluster and claim exclusive access to the shared storage.
To avoid such a tricky and undesirable situation,the basic approach is STOMITH(Shoot the Other Machine in the Head) fencing. In STOMITH systems, the errant cluster node is simply reset and forced to reboot.

Solution
Please check with System Admin for Network errors around the incident time.

My Journey as an Oracle DBA

Sunday, 17 March 2013

RAC Server Rebooted Abnormally

No comments:

Post a Comment