Manual recovery of drbd two primary mode failures


You can build an active-active Linux cluster only using 2 servers with drbd. This setup requires drbd configured in allow-two-primaries mode, logical volume manager with clustered locking and a clustered filesystem like gfs2.

But especially if you are not using a hardware fencing device, bad things can happen.

For example, you have two identical server named A and B, you lost network connection on A for a reasonable time, you just reboot both A and B and both nodes started on Primary/Unknown mode. When drbd runs with this mode, your drbd setup unaware of changes made on other node. You have to solve this problem and make drbd setup sync again.

First of all you have to decide which node has most recent modifications, node A or node B?

If you know that node B has most recent changes and want to fully synchronize node A from node B, you have to follow this guideline.

To start working on node A, we have to disconnect node A from drbd resource pool (assume that resource name is data_res):

Important! Run following commands on node A.

$ sudo drbdadm disconnect data_res

If you got an error like Invalid configuration request you already disconnected from the resource pool, so ignore this error.

After disconnect, A node still Primary/Unknown state, you have to change state to Secondary. To do this run following command:

$ sudo drbdadm secondary data_res

If you’re using a clustered filesystem over lvm, you will get an error something like that device already opened or State change failed: (-12) Device is held open by someone. You can solve this problem with unmounting cluster filesystem and de-activating volume group which runs on drbd device. You can get volume group names with vgdisplay command. If you have a volume group named vg-data, it can be de-activated with:

$ sudo vgchange -an vg-data

After that we can try to make node A secondary again.

In the last step on the node A, we will connect to drbd resource pool again:

$ sudo drbdadm connect --discard-my-data data_res

Important! Run following commands on node B.

Sometimes our node B system can be in Standalone mode at this step. To start drbd synchronization, you have to run connect command on node B too like below (it is not required other than the Standalone mode):

$ sudo drbdadm connect data_res

That’s all, syncronization will start automatically. You can monitor the process through /proc/drbd. Node A and Node B will be cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate mode again after synchronization completed.

If Node A still seen as Secondary, you have to change it’s state to primary (run following command on Node A):

$ sudo drbdadm primary data_res

Now you can confirm that both of the Node A and Node B seems Primary/Primary:

$ sudo drbd-overview 
0:data_res/0  Connected Primary/Primary UpToDate/UpToDate C r----- lvm-pv