Tuesday, July 5, 2016

Why do we need the Voting disks in RAC -


Have you ever think about How clusterware detects network failures and how does it avoid split brain situation ?

If you really curious about it ,you have to understand the structure of voting disk.

In RAC, CSSD processes (Cluster Services Synchronization Daemon) monitor the health of RAC nodes employing two distinct heart beats: Network heart beat(NHB) and Disk heart beat(DHB). Healthy nodes will have continuous network and disk heartbeats exchanged between the nodes. Break in heart beat indicates a possible error scenario.There are few different scenarios possible with missing heart beats:

1. Network heart beat is successful, but disk heart beat is missed. – Problem with voting disk

2. Disk heart beat is successful, but network heart beat is missed. – Problem with private interconnect.

3. Both heart beats failed.

what is the function of Voting Disk?


Voting disk Manages cluster membership by way of a health check and arbitrates cluster ownership among the instances in case of network failures( interconnect). This is facilitated by recording DHB from each node within the cluster - Disk heartbeat is just a binary bit (or OS block) stating the connection status for all network heartbeat for all nodes.

CSSD process on every node makes entries in the voting disk to ascertain the membership of that node. By this way voting disk  records node membership information. Under normal operation RAC keeps track of node status using network heartbeat (using interconnect) and disk heartbeat.

Consider these two images as a demonstration of the dynamic data in the voting disks for two different scenarios-




NHB:- CSSD sends an NHB every seconds  from one node(for eg RAC1 ) to all other nodes in the cluster and receive a NHB  from every remote nodes (RAC2 & RAC3)and it will recorded in Voting as DHB.

ie, Apart from NHB oracle use DHB , which is required for split brain resolution , based on the NHB CSSD sends DHB  which states the network hearbeat for all nodes, like whether the Network heart beat is ok or not.

If we don’t have a voting disk and our private interconnect fails we might end up with a split brain scenario. Suppose we have a three node cluster, and node1 has lost the network connection to the Interconnect. In order to prevent that, redundant network cards are recommended since a long time. We introduced HAIP in 11.2.0.2 to make that easier to implement, without the need of bonding, by the way. But here, node1 cannot use the Interconnect anymore. It can still access the Voting Disk, though. Nodes 2 and 3 see their heartbeats still but no longer node1.The node with the network problem gets evicted by placing the Poison Pill into the Voting File for node1. cssd of node1 will commit suicide now and leave the cluster.

How it works internally?

If the disk block(within the voting disk) is not updated in a short time-out period, that node is considered unhealthy and  may be rebooted to protect the database information. In this case , a message to this effect is written in the kill block of the node. Each node  reads its kill block once per second, if the kill block is overwritten node commits suicide.During reconfig (join or leave) CSSD monitors all nodes and determines whether  a node has a disk heartbeat, including those with no network heartbeat. If no disk  heartbeat is detected  then node is declared as dead.

The first image shows a fully working RAC with all nodes up and running. But on the other image we can see that vote disks contains FAIL(fail is actually a block that has NOT been written the NHB) and now has the kill block (poison pill) filled up. When RAC1 loses the network heartbeat to the other nodes, it gets evicted by by the other nodes by filling the Kill block on the Voting disk for RAC1. Forcing the node to panic and remove itself from the cluster.

What is CSS miscount ?

The CSS miscount represents the maximum seconds the network hearbeat can be missed before entering into cluster reconfiguration and evict the node. The default CSS miscount is 30 seconds.
A node must be able to access more than half of the voting disks at any time. For example, let’s have a two node cluster with an even number of let’s say 2 voting disks. Let Node1 is able to access voting disk1 and Node2 is able to access voting disk2 . This means that there is no common file where clusterware can check the heartbeat of both the nodes.  Hence, if we have 2 voting disks, all the nodes in the cluster should be able to access both the voting disks. If we have 3 voting disks and both the nodes are able to access more than half i.e. 2 voting disks, there will be at least on disk which will be accessible by both the nodes. The clusterware can use that disk to check the heartbeat of both the nodes. Hence, each  node should be  able to access more than half the number of voting disks. A node not able  to do so will have to be evicted from the cluster to maintain the integrity of the cluster.Loss of more than half your voting disks will cause the entire cluster to fail.

Why should we have an odd number of voting disks?

Here is a table which represents the number of voting disks whose failure can be tolerated for different numbers of voting disks:

It can be seen that number of voting disks whose failure can be tolerated is same for (2n-1) as well as 2n voting disks where n can be 1, 2 or 3. Hence to save a redundant voting disk, (2n-1) i.e. an odd number of voting disks are desirable.

Example 1:-


Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3(ie node 3 loses the network heart beat to other ) whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus spli-brain, so the healthy nodes Node 1 & Node 2 would would update the kill block in the voting disk for Node 3. Then when during pread() system call of CSSD of Node 3, it sees a self kill flag set and thus the CSSD of Node 3 evicts itself.

Example 2:- 

A node must be able to strictly access more than half of the voting disks at any time. Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts.

You will see error like following in ocssd.log

Insufficient voting files found, found 1 of 3 configured, needed 2 voting files

Found 1 voting files, but 2 are required. Terminating due to insufficient configured voting files

Reference:- https://www.protractus.com/2014/12/rac-voting-disks-what-the-fuzz-is-all-about/

No comments:

Post a Comment