Best Practices for Avoiding Production Outages

Common Issues

Author/Publish instance is very slow Or High CPU usage

High Memory usage on AEM instances

  • Check the memory usage at [1]
  • Generate heap dumps using article at [2] and share it with AEM Support for further analysis

[1] http://<host>:<port>/system/console/memoryusage
[2] https://helpx.adobe.com/experience-manager/kb/AnalyzeMemoryProblems.html

High CPU usage after dispatcher cache clear

  • You can define cache invalidation by using the "/invalidate" and "/statfileslevel"
    • If you deny all for invalidation and with no /statfileslevel -> Only activated pages are deleted
    • If you allow all for invalidation and /statfileslevel defined -> Only pages will get invalidated in the same folder where the stat file was updated
    • If you allow all for invalidation and with no /statfileslevel -> All pages get invalidated wherever they are located under docroot
  • After code deployments, try to recache the pages. Immediate recaching ensures that Dispatcher retrieves and caches the page only once, instead of once for each of the simultaneous client requests.
  • Refer to the Optimizing Dispatcher Cache article for more in-depth insights.

Observed SegmentNotFound Exceptions in the logs

  • Follow steps at Resolving Segmentnotfound
  • If no good revision is found, try to find the corrupted nodes using the script mentioned in Part B of the above article.
  • If corruption is found under any of the folders except /apps, please contact AEM Support team for further guidance.

RCA for AEM outage which resolved after restart

Share the following data with the AEM Support team to analyze RCA:

  • Log files during the outage
  • Thread dumps taken during the outage
  • If available, Heap Dumps during the outage

Session leak in AEM

Check and analyze if JCR session leaks in your AEM instance

Detailed Guide on Troubleshooting Critical Issues