How to Share a Single CRX Datastore To Preserve Disk Space

Question, Problem

I would like to preserve disk space, how can I consolidate and share storage of CRX data files?

Note: the same method can be applied to instances of CQ5 WCM version 5.2.1

Answer, Resolution

One way to preserve disk space in your environment is to share the CRX datastore directory over a network share between multiple installations of CRX.

WARNING: This process requires you to move your datastore directory to a shared network drive. If you are moving your datastore directory from a local folder to a shared network directory then you will experience a considerable loss in performance. Please consider this before implementing this process and weigh the benefits accordingly.

First of all, what is the datastore?

The data store is used by CRX to store large binary values. Normally all node and property data is stored in a persistence manager, but for large binaries for example, special treatment can improves performance and reduces disk usage.

How to combine the datastore between multiple CRX instances

Consider a scenario where you have two CRX instances, A and B (it doesn't matter if either is an author or a publish). A is installed under /opt/day/crxA and B is installed under /opt/day/crxB. In a default unclustered installation of CRX, the datastore is stored under <path to instance>/crx-quickstart/repository/shared/repository/datastore.

Note that your "shared" directory path may be different if you have configured a cluster "shared" directory. The datastore is stored under <shared path>/repository/datastore.

Instructions: Copy and consolidate/combine the files of instance A and instance B's datastores for example (if A and B are on 2 different physical servers and use a common network share /mnt/nfsshare1):

on server A
%> cp -R /opt/day/crxA/crx-quickstart/repository/shared/repository/datastore /mnt/nfsshare1/combined-datastore

on server B
%> cp -R /opt/day/crxB/crx-quickstart/repository/shared/repository/datastore /mnt/nfsshare1/combined-datastore

Configure repository.xml to point to the new datastore path for both instance A and B Open repository.xml on instance A and change the datastore shared path (/opt/day/cq5A/crx-quickstart/server/runtime/0/crx/WEB-INF/repository.xml)

<DataStore class="org.apache.jackrabbit.core.data.FileDataStore"> 
<param name="path" value="/mnt/nfsshare1/combined-datastore"/> 
<param name="minRecordLength" value="4096"/> 
</DataStore> 

How to run datastore garbage collection when the datastore is shared by multiple instances of CRX

WARNING: This only applies to CRX1.4.2 or patched versions of 1.4.1 as datastore garbage collection only works properly in 1.4.1 after applying a CRX hotfix (contact day support for more info). Please test this in a dev environment before implementing this in production.

When multiple CRX instances use the same datastore: First, call gc.scan() on the instance A, then on B and so on. At the end, call gc.deleteUnused() on instance A:

Here is the initialization code:

import org.apache.jackrabbit.core.data.GarbageCollector; 
//... 
GarbageCollector gc; 
SessionImpl si = (SessionImpl)session; 
gc = si.createDataStoreGarbageCollector(); 
 
// optional (if you want to implement a progress bar / output): 
gc.setScanEventListener(this); 

Here is pseudo code showing the order of how the Garbage collection scans should be run on the separate instances (this example assumes a 3 instance cluster A, B, and C):

gcA.scan(); 
gcB.scan(); 
gcC.scan(); 
gcA.stopScan(); 
gcA.deleteUnused(); 
gcB.stopScan(); 
gcC.stopScan(); 

An alternative method is:

  1. Write down the current time = X
  2. Run gc.scan() on each repository
  3. Manually delete files with last modified date older than X

See here for more info http://wiki.apache.org/jackrabbit/DataStore#Running_Data_Store_Garbage_Collection_.28Jackrabbit_1.x.29

Affected Versions

CRX 1.4.2.X

[1] http://wiki.apache.org/jackrabbit/DataStore

 Adobe

Get help faster and easier

New user?