Replication queue stuck until AEM is restarted

Issue

Replication agent queue items on the author instance are piling up after the publish instances crashed.  Only restarting the author instance clears the queues.

Thread dumps show the replication queue's thread stuck in socketRead state:

"pool-6-thread-68-com_day_cq_replication_job_publish1(com/day/cq/replication/job/publish1)" daemon prio=10 tid=0x00007ff0c41b1800 nid=0x2e7b runnable [0x00007ff05923f000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
- locked <0x00000006e0ba67b0> (a java.io.BufferedInputStream)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at com.day.cq.replication.impl.transport.Http.deliver(Http.java:510)
at com.day.cq.replication.impl.transport.Http.deliver(Http.java:170)
at com.day.cq.replication.impl.AgentImpl.doReplicate(AgentImpl.java:474)
- locked <0x000000069235a868> (a com.day.cq.replication.impl.AgentImpl)
at com.day.cq.replication.impl.AgentImpl.process(AgentImpl.java:371)
at com.day.cq.replication.impl.queue.ReplicationQueueImpl.process(ReplicationQueueImpl.java:285)
at com.day.cq.replication.impl.AgentManagerImpl.process(AgentManagerImpl.java:409)
at org.apache.sling.event.impl.jobs.queues.AbstractJobQueue$2.run(AbstractJobQueue.java:666)
- locked <0x00000006e0c9a080> (a java.lang.Object)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Cause

TCP/IP sockets for replication on the author instance were stuck in socketRead state waiting forever on the publish instance since the instance had crashed.

Resolution

To prevent this issue in the future, set timeouts on the replication network connections. 

To set the timeouts, follow these steps: 

  1. Go to all your replication agents via http://aem-host:port/etc/replication/agents.author.html.

  2. Open each active agent's page.

  3. Click Edit.

  4. Select the Extended tab.

  5. Set Connect Timeout to 10000.

  6. Set Socket Timeout to 300000.

  7. Click Ok to save.

 Adobe

쉽고 빠르게 지원 받기

신규 사용자이신가요?

Adobe MAX 2024

Adobe MAX
크리에이티비티 컨퍼런스

10월 14~16일 마이애미 비치 및 온라인

Adobe MAX

크리에이티비티 컨퍼런스

10월 14~16일 마이애미 비치 및 온라인

Adobe MAX 2024

Adobe MAX
크리에이티비티 컨퍼런스

10월 14~16일 마이애미 비치 및 온라인

Adobe MAX

크리에이티비티 컨퍼런스

10월 14~16일 마이애미 비치 및 온라인