* The following information is applicable for Oak version 1.6+ and AEM 6.3+.
Due to the limited available memory (off and on heap), when the instance's repository grows at a certain level, the caches get overloaded.
As a result, most of the repository accesses read data directly from disk, which is much slower, resulting in a bad end-user experience.
Overall, the instance becomes slow: response time increases and Online Revision Cleanup takes much more time to complete, sometimes over passing the allocated maintenance window.
At system level, a constant high IO activity is observed.
There are multiple endpoints that are monitored to determine when the system becomes IO bound.
The following paragraphs discuss the available endpoints and the main indicators.
There are various endpoints for monitoring IO-related metrics in AEM, the JVM, and the OS.
Together they provide different perspectives on the overall throughput in the system at the various layers: from JCR sessions to commit in the TarMK to disk IO of the TarMK.
Combined with information collected with JVM and OS level tooling, they provide a wealth of information about the system's health and to help finding bottlenecks.
In addition to those endpoints, there are many JVM and OS-specific tools that help gaining further insight in what the system is busy with:
What: the number of disk operations (reads/writes) per time unit (second)
How: OS level tooling (for example: vmstat, iostat on UNIX)
Normal: low level of disk reads (close to zero); constant, low number of writes (see image). Peaks during revision cleanup.
Warning: high and growing level of disk reads is a sign of memory undersizing (see image).
DISCLAIMER: a high volume of disk IO is due to other operations running on AEM (for example: assets ingestion) or by another process (for example: virus scan), so make sure to exclude any other cause before diagnosis Segment Tar as the culprit. Generally, the trend over days is more relevant than local peaks.
The absolute values are not relevant here and can vary depending on the instance size, the traffic, and the underlying hardware.