Offline revision cleanup breaks the repository in case of low disk space available

Issue

After running Offline Revision Cleanup (aka Offline Tar Compaction) on AEM 6.3, the following exception can be found in the error.log:

05.10.2017 16:45:20.437 *ERROR* [FelixStartLevel] org.apache.jackrabbit.oak-segment-tar [org.apache.jackrabbit.oak.segment.SegmentNodeStoreService(208)] The activate method has thrown an exception (java.lang.IllegalStateException: This builder does not exist: null)java.lang.IllegalStateException: This builder does not exist: null

The instance doesn't start at this point, and performing a repository check ("oak-run check" tool) cannot detect this corruption. Repair is not possible since the data was deleted by the failed Offline Revision Cleanup.

If you had "logging enabled" during the Offline Revision Cleanup, watch out for the message:

10:43:19.886 WARN [TarMK disk space check [/crx-quickstart/repository/segmentstore]] FileStore.java:701 Available disk space (14.6 GB) is too low, current repository size is approx. 59.4 GB

If this message is later followed by the message:

10:43:19.889 INFO [main] LoggingGCMonitor.java:45 TarMK GC #0: compaction succeeded in 2.034 min (122066 ms), after 0 cycles.

this means the error occurred, the repository got corrupted and needs to be restored from a backup.

If the message instead reads:

11:01:27.312 WARN  [main] LoggingGCMonitor.java:50          TarMK GC #0: compaction cancelled: Not enough disk space.

this means the error did not occur and there is no repository corruption. (Offline Revision Cleanup did not complete successfully as there was not enough disk space).

Environment

AEM 6.3 with Oak 1.6 (and corresponding oak-run version).

Affected versions

Oak 1.6.0, Oak 1.6.1, Oak 1.6.2, Oak 1.6.3, Oak 1.6.4, Oak 1.6.5, Oak 1.6.6, Oak 1.6.7

Non-affected

Oak 1.0.x, Oak 1.2.x, Oak 1.4.x, Oak 1.7.x, Oak 1.8.x and above. Online Revision Cleanup is not affected.

Cause

Offline Revision Cleanup monitors the disk space available during its execution.  It cancels itself when there is no enough disk space available.  An error condition is not handled correctly by this process causing an unrecoverable and potentially severe repository corruption.

This is tracked in OAK-7050:

Offline Revision Cleanup can corrupt the repository in some cases: When Offline Revision Cleanup is cancelled by the CancelCompactionSupplier, the corresponding return value is not correctly passed up the call chain resulting in a incomplete compacted head state being set as the compacted head state (instead of being discarded).

The cancellation is silently triggered when the available disk space is less than 25% of the actual size of the repository.

Resolution

As mitigation, make sure there is enough disk space available at all times when running Offline Revision Cleanup with a version of Oak that is affected by this issue.

Before starting the compaction, there must be enough free space to store double the current repository size.  At all times, when running compaction, the remaining disk space must not drop below 25% of the current size of the repository.

In the official documentation, you can find these guidelines:


What are the minimum requirements for disk space and heap memory when running Online Revision Cleanup?

Disk space is continuously monitored during Online Revision Cleanup. Should the available disk space drop below a critical value, the process will be cancelled. The critical value is 25% of the current disk footprint of the repository and it is not configurable.

It is recommended to size the disk at least two or three times larger than the repository size including the estimated growth.


OAK-7050 is fixed in oak-run version 1.6.8 which is targeted for release on January 8th, 2018.  This version of oak-run should be used on AEM 6.3 for running Offline Revision Cleanup.

If you have the need to run Offline Revision Cleanup prior to the release of Oak 1.6.8 and can't guarantee having available disk space 2 times higher than the size of the repository, an early build can be downloaded from here.

There is no need to update anything on AEM side, only the external oak-run tool is impacted.

Note:

As a strong recommendation, always use the version of oak-run that matches the version of Oak in the respective AEM instance.

To deal with this issue, we recommend to:

  1. Upgrade all instances to Oak 1.6.8 once the corresponding HF is released and from there keep on using the matching version of oak-run.
  2. Until upgrading to Oak 1.6.8 is completed, use oak-run 1.6.8 (or oak-run-1.6.7-R1817912.jar from the above link until 1.6.8 is released) for offline compaction and use the matching version of oak-run for all other commands.

 Adobe

Get help faster and easier

New user?