Relevant health checks for this presentation:
- Query Performance
- Query Traversal Limits
- Synchronized Clocks
- Observation Queue
- Lucene Indexes in the System Maintenance composite check, the HCs for maintenance tasks:
- Revision Cleanup
- DataStore GC
- Continuous Revision GC
Query Performance checks the statistics collected by the RepositoryStats MBean, specifically at QueryAverage "per minute" aggregation and returns the following statuses:
- returns Critical status if any of the averages exceed a configurable critical threshold (the default value is 15 milliseconds)
- returns the Warn status if any of the averages exceeds a configurable warning threshold (the default value is 10 milliseconds)
The MBean for this health check is: org.apache.sling.healthcheck:name=queriesStatus,type=HealthCheck.
Query Traversal Limits checks the QueryEngineSettings MBean, more specifically the LimitInMemory and LimitReads attributes, and returns the following statuses:
- returns the Warn status if one of the limits is equal or higher than the Integer.MAX_VALUE
- returns the Warn status if one of the limits is lower than 10000 (the recommended setting from Oak)
- returns the Critical status if the QueryEngineSettings or any of the limits cannot be retrieved
Since in general query limits don't change very often, this check is usually performed once. The Mbean for this health check is org.apache.sling.healthcheck:name=queryTraversalLimitsBundle,type=HealthCheck.
Observation Queue Length iterates over all Event Listeners and Background Observers, and compares their queueSize to their maxQueueSize and:
- returns Critical status if the queueLength exceeds maxQueueLength (i.e. this is when events would be dropped)
- returns Warn if queueLength is over maxQueueLengthInt * WARN_THRESHOLD (the default value 0.75)
The warning threshold is configurable and the maximum length of each queue comes from separate configurations (Oak and AEM), and is not configurable from this health check. The MBean for this health check is org.apache.sling.healthcheck:name=ObservationQueueLengthHealthCheck,type=HealthCheck.
The Asynchronous Indexes check returns the following statuses:
- return Critical status if at least one indexer is failing
- check lastIndexedTime for all indexers and:
- return Critical status if it's more than 2 hours ago
- return Warning status if it's between 2 hours and 45 minutes ago
- return OK status if it's less than 45 minutes ago
- If none of these conditions happen, return OK status
Both the Critical and Warn status thresholds are configurable. The Mbean for this health check is org.apache.sling.healthcheck:name=asyncIndexHealthCheck,type=HealthCheck.
This check uses the data exposed by the Lucene Index Statistics MBean to identify large indexes and returns the following indexes:
- it returns a warning status if there is an index with more than 1 billion documents
- it returns critical status if there is an index with more than 1.5 billion documents
The thresholds are configurable and the MBean for the health check is org.apache.sling.healthcheck:name=largeIndexHealthCheck,type=HealthCheck.
The System Maintenance is a composite check that returns the green status if all maintenance tasks are running according to plan:
- each maintenance task is accompanied by this check
- you need to configure the Audit Log and Workflow Purge maintenance tasks or otherwise remove them from the maintenance windows. If left unconfigured, these tasks will fail on the first attempted run, so the System Maintenance check will return the Critical status.
- on AEM 6.2 and lower, the check returns a Warning status right after startup because the tasks never run. This has been improved in 6.4.
The MBean for this health check is org.apache.sling.healthcheck:name=systemchecks,type=HealthCheck.