Garbage Collection causing cluster to fail with <Error accessing the cache container - Error on PUT action for cache Local:SERVICE_FACTORY_CACHE> | AEM Forms

Country

United Kingdom

Issue

Garbage Collection on JVM instances configured with a large Heap Size can stop the system and cause the cluster node to get forced out of the cluster. The following exception displays in the LOG:

<Feb 24, 2017 1:47:07 AM CET> <Error> <com.adobe.idp.dsc.invocation.impl.ServiceFactoryCache> <BEA-000000> <Error accessing the cache container - Error on PUT action for cache Local:SERVICE_FACTORY_CACHE>
com.adobe.livecycle.cache.CacheActionException: Error accessing the cache container - Error on GET action for cache Replicated:SystemInfoStore
at com.adobe.livecycle.cache.adapter.GemfireCacheAdapter.put(GemfireCacheAdapter.java:352)
at com.adobe.monitor.stats.SystemCacheManager.addAndUpdateStatisticsCache(SystemCacheManager.java:45)
at com.adobe.monitor.stats.SystemInfoRetrievalThread.retrieveInfo(SystemInfoRetrievalThread.java:110)
at com.adobe.monitor.stats.SystemInfoRetrievalThread.run(SystemInfoRetrievalThread.java:70)
Caused by: com.gemstone.gemfire.distributed.DistributedSystemDisconnectedException: GemFire on weform4p(GC_gBqP:2381)<v3>:23067 started at Thu Feb 23 10:27:17 CET 2017: Message distribution has terminated, caused by com.gemstone.gemfire.ForcedDisconnectException: This member has been forced out of the distributed system. Reason='did not respond to are-you-dead messages'
at com.gemstone.gemfire.distributed.internal.DistributionManager$Stopper.generateCancelledException(DistributionManager.java:782)
at com.gemstone.gemfire.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:942)
at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1060)

Environment

AEM Forms
LiveCycle ES4 SP1

Cause

AEM Forms and LiveCycle use Gemfire as a caching mechanism. To remain in the cluster, each node responds to heartbeat requests from the member coordinator within 5000 ms (Default value). If the member does not respond after two Heartbeat requests, the cluster node is deemed faulty and is being forced out of the cluster.

If you have a JVM instance running with a large Heap Size, Full Garbage Collection (GC) can cause the whole system to pause for few seconds until the GC completes. There is a possibility that you notice error related to high memory usage from the Work Manager, or high CPU usage from the Scheduler indicating missing triggers. Full GC prevents Gemfire from responding to heartbeats and causes the Cluster node to get forced out if it takes more than 2 x 5000 ms.

Resolution

First step is to confirm that it is the behavior you are experiencing by analyzing GC statistics. Generate GC statistics by adding the following JVM parameters (The following example creates a log on the C: drive):

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=30M -Xloggc:<path to a GC log file>

Request the GC logs when the issue occurs again.

Look at the time when the issue occurs. It displays the following lines in the LOG:

2017-03-02T01:30:41.415+0100: 107053.338: [GC [PSYoungGen: 4957696K->143858K(3736064K)] 6633519K->3611125K(8978944K), 13.7360350 secs] [Times: user=11.54 sys=1.88, real=13.73 secs]
2017-03-02T01:36:29.568+0100: 107401.492: [GC [PSYoungGen: 3736050K->825332K(4417536K)] 7203317K->4928257K(9660416K), 10.2544810 secs] [Times: user=13.20 sys=1.38, real=10.25 secs]

It indicates that the Full GC took 13.73 s and 10.28 s, which are above the time taken for the two Heartbeats. It causes the cluster node to get forced out.

There are few options to resolve the issue:

  1. Optimize GC collection on your system to make sure to complete within the Gemfire timeout. For the guidelines, refer to https://www.cubrid.org/blog/how-to-tune-java-garbage-collection.
  2. Generate a Heap Dump when Full GC is executed and analyze which objects are being heavily used. Use it as a basis to optimize your workflow. 
  3. Increase the Gemfire Membership Timeout to a value high enough to allow the Full GC to complete. For example, the JVM argument below increases the timeout to 15 s which covers the situation found in the GC logs above. For more information, refer to https://pubs.vmware.com/vfabric5/index.jsp?topic=/com.vmware.vfabric.gemfire.6.6/reference/topics/gemfire_properties.html:
“-Dgemfire.member-timeout=15000”