Symptom
Failure to start the Slave instance are observed under various scenarios like below:
a) Bringing the Slave node down for Cold Backup and then starting it back up
b) Bringing the Cluster down for maintenance operations and then starting the cluster nodes
Causes
1) Network Issues at the infrastructure end - Slave runs into "Read from master timed out" situation
On Master: com.day.crx.core.cluster.ClusterMaster I/O error while processing connect. java.net.SocketTimeoutException: Read timed out On Slave: com.day.crx.core.cluster.ClusterMaster I/O error while processing connect. java.net.SocketTimeoutException: Read timed out
Solution: Please check your network infrastructure and make sure there are no firewall changes or outages to validate if Master and Slave nodes are able to communicate with each other.
2) Incorrect sequence followed for Starting Stopping the Cluster nodes - Causing to run into "Out of Sync" situation
* ClusterTarSet: Could not open (ClusterTarSet.java, line 820) java.io.IOException: This cluster node and the master are out of sync. Operation stopped. Please ensure the repository is configured correctly. To continue anyway, please delete the index and data tar files on this cluster node and restart. Please note the Lucene index may still be out of sync unless it is also deleted
Analysis and Possible reason of Out of Sync Cluster Nodes
You run into this situation due to improper sequence of shutdown and re-start of cluster nodes. If the server on which master instance was residing was brought down as a part of regular maintenance and the slave took over as new Master and was allowed run as a standalone instance. Later on when the slave was shutdown and master instance was started, slave refused the join and the nodes went out of sync. This is expected as slave would have newer revisions.
Solution:
a) Let Old Slave node run as the new Master.
b) Start the old Master and let it join as Current Slave for now.
c) Allow the Old Master [Current Slave] to connect to Old Slave [New Master] and sync itself up with the latest revisions.
d) Once the sync completes, you can then switch the roles of Master and Slave nodes by just stopping and re-starting the current Master [Old Slave] Node.
e) As the Current Master [Old Slave] is stopped, Current Slave [Old Master] regains back the Master role.
3) Manual Deletion of marker file Clustered.txt on any stopped Cluster node - This causes the instance to be started as Master node while it was supposed to start as Slave node and join an existing Master
Analysis and Possible reason of Out of Sync Cluster Nodes
If you run into any abrupt situation or ungraceful shutdown of your running master instance, then the master instance cannot rejoin the cluster after being restarted. This can occur in cases where a write operation was in progress at the moment that the master node was stopped, or where a write operation occurred a few seconds before the master instance was stopped. In these cases, the slave instance may not receive all changes from the master instance. When the master is then re-started, CRX will detect that it is out of sync with the remaining cluster instances and the repository will not start.
Possibly, you tried to start the Master cluster node by deleting clustered.txt file manually. Generally speaking, the clustered.txt should not be deleted anytime to start up the instance. It's a marker file to let us know that this instance has to join as a slave next time it’s started. The cluster node that stops last does not have clustered.txt file in it. This helps to identify that this node was the last running Master node and should be started first.
If this file is present, the node cannot be started as Master node. You should get message like below.
Explanation:
Means that there was another cluster node running even after this node was stopped so that node has latest revisions and content. Thus, the last node being stopped in the cluster would be the Master node whenever you start up your cluster nodes.
ClusterController: Trying to connect to a master, as the file clustered.txt exists.
Clustered.txt should not be deleted anytime to start up the instance. This should only be done, in case you want to start the instance as standalone and not make it a part of cluster.
Solution:
a) Stop the Slave Node which threw this Out of Sync exception
b) Stop the current running master node (which was started by deleting the clustered.txt file)
c) Start the Slave Node (as in this case it would be the last running master since it has latest revisions). This will take over as new Master
d) Start the Master Node which would then join the cluster as Slave
e) Once the sync completes, you can then stop and start the current running Master node (Old Slave) to get back the role consistency of Master behaving as Master and Slave behaving as Slave
a) For more information on Out of Sync situation, please refer our documentation link
b) For more information on procedure to Clone the Slave, please refer our documentation link
Keep in mind when Cloning the Master to create the Slave
Due to ungraceful shutdown/hard reboot/network issues/power failures etc. of any of the cluster nodes, if in case you run into out of sync issue where the cluster nodes are no more synchronized with each other, then as a recovery procedure, you need to either restore from old backup or re-create the cluster node.
In such scenarios, you need to create a clone of a cluster node and join it to the cluster as fresh Slave So, you need to identify which node was last running Master node as that node would have latest content and revisions. Then, you should only perform the backup on that node and create a Slave.
If you try to clone the wrong node (old Master being stopped first) then you would run into high chances of losing data present in the last running node. So, ensure that while choosing which node to be cloned, you identify the right one.
This article applies to all CQ 5.x version which have CRX version 2.x only (greater than 2.2.0.36)