Replication agent queue items on the author instance are piling up after the publish instances crashed. Only restarting the author instance clears the queues.
Thread dumps show the replication queue's thread stuck in socketRead state:
"pool-6-thread-68-com_day_cq_replication_job_publish1(com/day/cq/replication/job/publish1)" daemon prio=10 tid=0x00007ff0c41b1800 nid=0x2e7b runnable [0x00007ff05923f000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x00000006e0ba67b0> (a java.io.BufferedInputStream) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at com.day.cq.replication.impl.transport.Http.deliver(Http.java:510) at com.day.cq.replication.impl.transport.Http.deliver(Http.java:170) at com.day.cq.replication.impl.AgentImpl.doReplicate(AgentImpl.java:474) - locked <0x000000069235a868> (a com.day.cq.replication.impl.AgentImpl) at com.day.cq.replication.impl.AgentImpl.process(AgentImpl.java:371) at com.day.cq.replication.impl.queue.ReplicationQueueImpl.process(ReplicationQueueImpl.java:285) at com.day.cq.replication.impl.AgentManagerImpl.process(AgentManagerImpl.java:409) at org.apache.sling.event.impl.jobs.queues.AbstractJobQueue$2.run(AbstractJobQueue.java:666) - locked <0x00000006e0c9a080> (a java.lang.Object) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
TCP/IP sockets for replication on the author instance were stuck in socketRead state waiting forever on the publish instance since the instance had crashed.