Go to all your replication agents via http://aem-host:port/etc/replication/agents.author.html.
Issue
Replication agent queue items on the author instance are piling up after the publish instances crashed. Only restarting the author instance clears the queues.
Thread dumps show the replication queue's thread stuck in socketRead state:
"pool-6-thread-68-com_day_cq_replication_job_publish1(com/day/cq/replication/job/publish1)" daemon prio=10 tid=0x00007ff0c41b1800 nid=0x2e7b runnable [0x00007ff05923f000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x00000006e0ba67b0> (a java.io.BufferedInputStream) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at com.day.cq.replication.impl.transport.Http.deliver(Http.java:510) at com.day.cq.replication.impl.transport.Http.deliver(Http.java:170) at com.day.cq.replication.impl.AgentImpl.doReplicate(AgentImpl.java:474) - locked <0x000000069235a868> (a com.day.cq.replication.impl.AgentImpl) at com.day.cq.replication.impl.AgentImpl.process(AgentImpl.java:371) at com.day.cq.replication.impl.queue.ReplicationQueueImpl.process(ReplicationQueueImpl.java:285) at com.day.cq.replication.impl.AgentManagerImpl.process(AgentManagerImpl.java:409) at org.apache.sling.event.impl.jobs.queues.AbstractJobQueue$2.run(AbstractJobQueue.java:666) - locked <0x00000006e0c9a080> (a java.lang.Object) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Cause
TCP/IP sockets for replication on the author instance were stuck in socketRead state waiting forever on the publish instance since the instance had crashed.
Resolution
To prevent this issue in the future, set timeouts on the replication network connections.
To set the timeouts, follow these steps:
-
-
Open each active agent's page.
-
Click Edit.
-
Select the Extended tab.
-
Set Connect Timeout to 10000.
-
Set Socket Timeout to 300000.
-
Click Ok to save.