There are various reasons for replication to fail. This article explains the approach one might take when analyzing these issues.
Are replications getting triggered at all when clicking the Activate button? If NOT then do the following:
- Go to /crx/explorer (CQ5.5) and login as admin.
- Open "Content Explorer"
- See if a node /bin/replicate or /bin/replicate.json exists. If the node exists then delete it and save.
Are the replications getting queued up in the replication agent queues?
Check this by going to /etc/replication/agents.author.html then click on the replication agents to check.
If one agent queue or a few agent queues are stuck:
- Does the queue show blocked status? If so then is the publish instance not running or totally unresponsive? Check the publish instance to see what is wrong with it (i.e. check the logs, and see if there is an OutOfMemory error or some other issue. Then if it is just generally slow then take thread dumps and analyze them.
- Does the queue status show Queue is active - # pending? Basically the replication job could be stuck in a socket read waiting for the pubilsh instance or dispatcher to respond. This could mean that the publish instance or dispatcher is under high load or stuck in a lock. Take thread dumps from author and publish in this case.
- Open the thread dumps from author in a thread dump analyzer, check if it shows that the replication agent's sling eventing job is stuck in a socketRead.
- Open the thread dumps from publish in a thread dump analyzer, analyze what might be causing the publish instance not to respond. You should see a thread with POST /bin/receive in its name, that is the thread receiving the replication from author.
If all agent queues are stuck
- It is possible that a certain piece of content cannot be serialized under /var/replication/data due to repository corruption or some other issue. Check the logs/error.log for a related error. To clear out the bad replication item, do the following:
- Go to http://<host>:<port>/crx and login as admin user. In CQ5.5 go to http://<host>:<port>/crx/explorer instead.
- Click on "Content Explorer".
- In the "Content Explorer" window click on the magnifying glass button on the top right of the window and a search dialog will pop up.
- Select the "XPath" radio button.
- In the "Query" box enter this query /jcr:root/var/eventing/jobs//element(*,slingevent:Job) order by @slingevent:created
- Click "Search"
- In the results the top items are the latest sling eventing jobs. Click on each one and find the stuck replications that match what shows up in the top of the queue.
- There might be something wrong with sling eventing framework job queues. Try restarting the org.apache.sling.event bundle in the/system/console.
- It might be that job processing is completely turned off. You can check that under Felix Console in the Sling Eventing Tab. Check if it displays - Apache Sling Eventing (JOB PROCESSING IS DISABLED!)
- If yes, then check Apache Sling Job Event Handler under Configuration tab in Felix Console. Might be that 'Job processing Enabled' checkbox is unchecked. If that is checked and still it displays that 'job processing is disabled', then check if there is any overlay under /apps/system/config that is disabling the job processing. Try creating an osgi:config node for jobmanager.enabled with a boolean value to true and re-check if the activation started and there are no more jobs in queue.
- It might also be the case that DefaultJobManager configuration gets into an inconsistent state. This can happen when someone manually modifies the 'Apache Sling Job Event Handler' configuration via the OSGiconsole (For example disable and re-enable the 'Job Processing Enabled' property and Save the configuration).
- At this point the DefaultJobManager configuration which is stored at crx-quickstart/launchpad/config/org/apache/sling/event/impl/jobs/DefaultJobManager.config gets into an inconsistent state. And even though the 'Apache Sling Job Event Handler' property shows 'Job Processing Enabled' to be in checked state, when one navigates to the Sling Eventing tab, it shows the message - JOB PROCESSING IS DISABLED and the replication does not work.
- To resolve this issue, one would need to navigate to the Configuration page of the OSGi console and delete the 'Apache Sling Job Event Handler' configuration. Then restart the Master node of the cluster to get the configuration back into a consistent state. This should fix the issue and replication will start working again.
Create a replication.log
Sometimes it can be very helpful to set all replication logging to be added in a separate log file at DEBUG level. To do this:
- Go to http://host:port/system/console/configMgr and login as admin.
- Find the Apache Sling Logging Logger factory and create an instance by clicking the + button on the right of the factory configuration. This will create a new logging logger.
- Set the configuration like this:
- Log Level: DEBUG
- Log File Path: (CQ5.4 and 5.3) ../logs/replication.log (CQ5.5) logs/replication.log
- Categories: com.day.cq.replication
- If you suspect the problem to be related to sling eventing/jobs in any way then you can also add this java package under categories:org.apache.sling.event
Sometime it might be suitable to pause the replication queue to reduce load on the author system, without disabling it. Currently this is only possible by a hack of temporarily configuring an invalid port. From 5.4 onwards you could see pause button in replication agent queue it has some limitation
- The state is not persisted that means if you restart a server or replication bundle is recycled it gets back to running state.
- The pause is idle for a shorter period (OOB 1 hour after no activities with replication by other threads) and not for a longer time. Because There is feature in sling which avoid idle threads. Basically check if a job queue thread has been unused for a longer time, if so it kicks up clean up cycles. Due to cleanup cycle it stops the thread and hence the paused setting is lost. Since jobs are persisted it initiates a new thread to process the queue which does not have details of the paused configuration. Due to this queue turns into running state.
Page permissions are not replicated because they are stored under the nodes to which access is granted, not with the user.
In general page permissions should not be replicated from the author to publish and are not by default. This is because access rights should be different in those two environments. Therefore it is recommended to configure ACLs on publish separately from author.