Issue

Replication agent queue items on the author instance are piling up after the publish instances crashed.  Only restarting the author instance clears the queues.

Thread dumps show the replication queue's thread stuck in socketRead state:

"pool-6-thread-68-com_day_cq_replication_job_publish1(com/day/cq/replication/job/publish1)" daemon prio=10 tid=0x00007ff0c41b1800 nid=0x2e7b runnable [0x00007ff05923f000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
- locked <0x00000006e0ba67b0> (a java.io.BufferedInputStream)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at com.day.cq.replication.impl.transport.Http.deliver(Http.java:510)
at com.day.cq.replication.impl.transport.Http.deliver(Http.java:170)
at com.day.cq.replication.impl.AgentImpl.doReplicate(AgentImpl.java:474)
- locked <0x000000069235a868> (a com.day.cq.replication.impl.AgentImpl)
at com.day.cq.replication.impl.AgentImpl.process(AgentImpl.java:371)
at com.day.cq.replication.impl.queue.ReplicationQueueImpl.process(ReplicationQueueImpl.java:285)
at com.day.cq.replication.impl.AgentManagerImpl.process(AgentManagerImpl.java:409)
at org.apache.sling.event.impl.jobs.queues.AbstractJobQueue$2.run(AbstractJobQueue.java:666)
- locked <0x00000006e0c9a080> (a java.lang.Object)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Cause

TCP/IP sockets for replication on the author instance were stuck in socketRead state waiting forever on the publish instance since the instance had crashed.

Resolution

To prevent this issue in the future, set timeouts on the replication network connections. 

To set the timeouts, follow these steps: 

  1. Go to all your replication agents via http://aem-host:port/etc/replication/agents.author.html.

  2. Open each active agent's page.

  3. Click Edit.

  4. Select the Extended tab.

  5. Set Connect Timeout to 10000.

  6. Set Socket Timeout to 300000.

  7. Click Ok to save.

Questo prodotto è concesso in licenza in base alla licenza di Attribuzione-Non commerciale-Condividi allo stesso modo 3.0 Unported di Creative Commons.  I post su Twitter™ e Facebook non sono coperti dai termini di Creative Commons.

Note legali   |   Informativa sulla privacy online