Issue

The lucene index folder is several gigabytes.

Solutions

Solution 1: Remove audit event nodes from the index via indexing configuration.

  1. Find and prepare the indexing_configuration.xml file for modification.

    In CQ5.2.x-5.4 and CRX1.x-2.2, you can find the indexing configuration under this location:

    • In CQSE: crx-quickstart/server/runtime/0/WEB-INF/classes/indexing_configuration.xml
    • In third-party app server, it is contained within the CRX war file under WEB INF/classes/indexing_configuration.xml

    In CQ5.5 / CRX2.3+, see this article for how to modify the indexing_configuration.xml.

  2. Add the following index-rule to the top of the indexing_configuration.xml file:

    <index-rule nodeType="cq:AuditEvent">
    </index-rule>

    Note:

    If you disable indexing of audit events, then the CQ audit report no longer works.

WARNING: If you disable highlighting support then search result excerpts will no longer work in CQ.

Solution 2: Deactivate the highlighting feature can also help to reduce the overall index size.

crx-quickstart/repository/workspaces/crx.default/workspace.xml

<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
    ...
    <param name="supportHighlighting" value="false" />
</SearchIndex>

Note:

If you disable highlighting support, then search result excerpts no longer work in CQ.

Solution 3: Update tika-config.xml to Disable Indexing of PDF and MS Office Binaries.

CQ5.3-5.4 / CRX2.0-2.2

In CQ5.3, CQ5.4, and CRX2.0-2.2, do the following:

  1. Log in to your server and open a command prompt and change directories to crx-quickstart/server/runtime/0/_crx/WEB-INF/lib
  2. Run this command to extract the tika-config.xml from the jackrabbit-core jar (make sure that you have the java jdk installed with the jar command):
    jar -xvf jackrabbit-core*.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  3. Modify the extracted file org/apache/jackrabbit/core/query/lucene/tika-config.xml and modify as needed.  See the attached tika-config.xml for an example.
  4. jar -uvf jackrabbit-core-*.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  5. Restart CQ for the changes to take effect.

CQ5.5/CRX2.3

In CQ5.5, to update tika-config.xml, do the following:

  1. First go to the Felix Web Console http://<host>:<port>/system/console and find the 
    "Day CRX Sling - CRX Embedded Repository com.day.crx.sling.server" bundle.
  2. Copy the ID number of the bundle, this is the number on the left side.
  3. Log in to your server and open a command prompt.
  4. Change directories to the location where the bundle is stored (<id> is the id number from step 2):
    cd crx-quickstart/launchpad/felix/bundle<id>
  5. Change directories to where the embedded jars are persisted using this command (your versionX.Y folder may have a higher version than 0.0):
    cd version0.0/bundle.jar-embedded/
  6. Run this command to extract the tika-config.xml file from the jar file (your jackrabbit-core jar may have a higher version than 2.4.0):
    jar -xvf jackrabbit-core-2.4.0.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  7. Modify the extracted file org/apache/jackrabbit/core/query/lucene/tika-config.xml and modify as needed.  See the attached tika-config.xml for an example.
  8. To update the xml file in the jackrabbit-core jar, run this command:
    jar -uvf jackrabbit-core-2.4.0.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  9. Restart CQ for the changes to take effect.

WARNING: By disabling this feature you will no longer be able to find PDF or Office documents by searching CQ using terms contained within the contents of files.

Instructions for CQ5.5/CRX2.3 with service pack 2.1 onwards:

In CQ5.5/CRX2.3 Apache Tika configuration file resides within the jackrabbit-core jar and update to any service pack overwrite this change. From service pack 2.1 provided an configuration option to configure outside the jackrabbit-core jar.

  1. Save the tika-config.xml at <cq_home>/crx-quickstart/repository/workspaces/crx.default/tika-config.xml
  2. Modify SearchIndex element to include tikaConfigPath. Example for SearchIndex element in workspace.xml at [1].
  3. Restart CQ for the changes to take effect.
[1] repository/workspace/crx.default/workspace.xml
<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
     <param name="path" value="${wsp.home}/index"/>
     <param name="resultFetchSize" value="50"/>
     <param name="tikaConfigPath" value="${wsp.home}/tika-config.xml"/>
</SearchIndex>

Rebuild the Search Index:

After making the changes, you will need to rebuild the search index.

  1. Stop CQ/CRX
  2. Backup and delete these directories on the server:
    crx-quickstart/repository/repository/index
    crx-quickstart/repository/workspaces/crx.default/index
  3. Start CQ/CRX (IMPORTANT: Re-indexing can take anywhere from 1 hour to 48 hours depending on the amount of content you have in your repository.  Make sure that you have coordinated with your users to have a proper outage window.)

Download

To download tika-config.xml for CRX2.3 Click Here

Additional information

Apache Tika is an open source toolkit which detects and extracts metadata and structured content from various file types. &nbsp;It is used by the CRX Lucene search index for text extraction and by CQ DAM for metadata extraction. You can update the tika-config.xml file to add your own custom text extraction implementations and to disable text extractions on binary files such as pdfs and microsoft excel, word, and powerpoint documents.&nbsp; In the case of this article, we disable text extraction on certain file types to reduce CQ's Lucene search index size.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License  Twitter™ and Facebook posts are not covered under the terms of Creative Commons.

Legal Notices   |   Online Privacy Policy