Objective
* The following information is applicable for Oak version 1.6+ and AEM 6.3+.
Problem
Due to the limited available memory (off and on heap), when the instance's repository grows at a certain level, the caches get overloaded.
As a result, most of the repository accesses read data directly from disk, which is much slower, resulting in a bad end-user experience.
Symptoms
Overall, the instance becomes slow: response time increases and Online Revision Cleanup takes much more time to complete, sometimes over passing the allocated maintenance window.
At system level, a constant high IO activity is observed.
Steps
Troubleshooting
There are multiple endpoints that are monitored to determine when the system becomes IO bound.
The following paragraphs discuss the available endpoints and the main indicators.
Monitoring endpoints
There are various endpoints for monitoring IO-related metrics in AEM, the JVM, and the OS.
Together they provide different perspectives on the overall throughput in the system at the various layers: from JCR sessions to commit in the TarMK to disk IO of the TarMK.
Combined with information collected with JVM and OS level tooling, they provide a wealth of information about the system's health and to help finding bottlenecks.
- Each session exposes an SessionMBean instance, which contains counters like the number and rate of reads and writes to the session.
- The RepositoryStatsMBean exposes endpoints to monitor the number of open sessions, the session login rate, the overall read and write load across all sessions, the overall read and write timings across all sessions and overall load and timings for queries and observation.
- The SegmentNodeStoreStatsMBean exposes endpoints to monitor commits: number and rate, number of queued commits and queuing times.
- The FileStoreStatsMBean exposes endpoints reflecting the amount of data written to disk, the number of tar files on disk and the total footprint on disk.
In addition to those endpoints, there are many JVM and OS-specific tools that help gaining further insight in what the system is busy with:
- Java Mission Control (jmc) is a powerful tool to collect every performance aspect of a running JVM. Its ability to record IO per Java process can sometimes be invaluable.
- The command line tools jstat, jstack, and jmap are useful to get inside the JVM's garbage collector, the JVM's threads, and the JVM's heap, respectively.
- The OS level tools vmstat and iostat are used to examine IO and CPU usage at the operating system level.
Monitoring Disk IO
What: the number of disk operations (reads/writes) per time unit (second)
How: OS level tooling (for example: vmstat, iostat on UNIX)
Normal: low level of disk reads (close to zero); constant, low number of writes (see image). Peaks during revision cleanup.
Warning: high and growing level of disk reads is a sign of memory undersizing (see image).
DISCLAIMER: a high volume of disk IO is due to other operations running on AEM (for example: assets ingestion) or by another process (for example: virus scan), so make sure to exclude any other cause before diagnosis Segment Tar as the culprit. Generally, the trend over days is more relevant than local peaks.
The absolute values are not relevant here and can vary depending on the instance size, the traffic, and the underlying hardware.
Monitoring CPU
- What: time spent by CPU for various operations, especially waiting for IO.
- How: OS level tooling (for example: vmstat on UNIX). Not all the tools report this (for example: top).
- Normal: CPU is mostly used by the application at user level and the waiting for IO is a small percentage.
- Warning: CPU is spending most of the cycles waiting for IO, with an increasing trend, in the detriment of the user application.
- DISCLAIMER: a high percentage of CPU waiting for IO is due to other operations running on AEM (for example: assets ingestion) or by another process (for example: virus scan), so make sure to exclude any other cause before diagnosing Segment Tar as the culprit. Generally, the trend over days is more relevant than local peaks.
Monitoring the Commit Queue
- What: the commit queue is a buffering mechanism used by Segment Tar when the incoming volume of commits is higher than the processing speed.
- How: the real-time commit queue size is exposed via the JMX MBean: org.apache.jackrabbit.oak: COMMIT_QUEUE_SIZE (Metrics) (/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3DCOMMIT_QUEUE_SIZE%2Ctype%3DMetrics). It is accessed via http on the system console or by using a JMX client.
- Normal: an empty queue (size=0) shows that the system is in a healthy state and can process all the commits at the speed they come in. Local peaks that get processed fast are also normal.
- Warning: constant, nonzero queue, with an increasing trend means segment tar cannot process the commits at the incoming speed. There is a risk that the queue gets full and new commits are rejected.
- DISCLAIMER: a temporary high queue (peak) is due to an intensive operation being triggered (for example: replication or roll-out) or high traffic, so make sure to exclude any other cause before diagnosing Segment Tar as the culprit. Generally, the trend is more relevant than local peaks.