Issue

Replication agent queue items on the author instance are piling up after the publish instances crashed.  Only restarting the author instance clears the queues.

Thread dumps show the replication queue's thread stuck in socketRead state:

"pool-6-thread-68-com_day_cq_replication_job_publish1(com/day/cq/replication/job/publish1)" daemon prio=10 tid=0x00007ff0c41b1800 nid=0x2e7b runnable [0x00007ff05923f000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
- locked <0x00000006e0ba67b0> (a java.io.BufferedInputStream)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at com.day.cq.replication.impl.transport.Http.deliver(Http.java:510)
at com.day.cq.replication.impl.transport.Http.deliver(Http.java:170)
at com.day.cq.replication.impl.AgentImpl.doReplicate(AgentImpl.java:474)
- locked <0x000000069235a868> (a com.day.cq.replication.impl.AgentImpl)
at com.day.cq.replication.impl.AgentImpl.process(AgentImpl.java:371)
at com.day.cq.replication.impl.queue.ReplicationQueueImpl.process(ReplicationQueueImpl.java:285)
at com.day.cq.replication.impl.AgentManagerImpl.process(AgentManagerImpl.java:409)
at org.apache.sling.event.impl.jobs.queues.AbstractJobQueue$2.run(AbstractJobQueue.java:666)
- locked <0x00000006e0c9a080> (a java.lang.Object)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Cause

TCP/IP sockets for replication on the author instance were stuck in socketRead state waiting forever on the publish instance since the instance had crashed.

Resolution

To prevent this issue in the future, set timeouts on the replication network connections. 

To set the timeouts, follow these steps: 

  1. Go to all your replication agents via http://aem-host:port/etc/replication/agents.author.html.

  2. Open each active agent's page.

  3. Click Edit.

  4. Select the Extended tab.

  5. Set Connect Timeout to 10000.

  6. Set Socket Timeout to 300000.

  7. Click Ok to save.

이 작업에는 Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License의 라이센스가 부여되었습니다.  Twitter™ 및 Facebook 게시물은 Creative Commons 약관을 적용받지 않습니다.

법적 고지 사항   |   온라인 개인 정보 보호 정책