Optimize lucene index to gain diskspace and efficiency

Issue

The lucene index folder is several gigabytes.

Solutions

Solution 1: Remove audit event nodes from the index via indexing configuration.

  1. Find and prepare the indexing_configuration.xml file for modification.

    In CQ5.2.x-5.4 and CRX1.x-2.2, you can find the indexing configuration under this location:

    • In CQSE: crx-quickstart/server/runtime/0/WEB-INF/classes/indexing_configuration.xml
    • In third-party app server, it is contained within the CRX war file under WEB INF/classes/indexing_configuration.xml

    In CQ5.5 / CRX2.3+, see this article for how to modify the indexing_configuration.xml.

  2. Add the following index-rule to the top of the indexing_configuration.xml file:

    <index-rule nodeType="cq:AuditEvent">
    </index-rule>
    Lưu ý:

    If you disable indexing of audit events, then the CQ audit report no longer works.

WARNING: If you disable highlighting support then search result excerpts will no longer work in CQ.

Solution 2: Deactivate the highlighting feature can also help to reduce the overall index size.

crx-quickstart/repository/workspaces/crx.default/workspace.xml

<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
    ...
    <param name="supportHighlighting" value="false" />
</SearchIndex>
Lưu ý:

If you disable highlighting support, then search result excerpts no longer work in CQ.

Solution 3: Update tika-config.xml to Disable Indexing of PDF and MS Office Binaries.

CQ5.3-5.4 / CRX2.0-2.2

In CQ5.3, CQ5.4, and CRX2.0-2.2, do the following:

  1. Log in to your server and open a command prompt and change directories to crx-quickstart/server/runtime/0/_crx/WEB-INF/lib
  2. Run this command to extract the tika-config.xml from the jackrabbit-core jar (make sure that you have the java jdk installed with the jar command):
    jar -xvf jackrabbit-core*.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  3. Modify the extracted file org/apache/jackrabbit/core/query/lucene/tika-config.xml and modify as needed.  See the attached tika-config.xml for an example.
  4. jar -uvf jackrabbit-core-*.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  5. Restart CQ for the changes to take effect.

CQ5.5/CRX2.3

In CQ5.5, to update tika-config.xml, do the following:

  1. First go to the Felix Web Console http://<host>:<port>/system/console and find the 
    "Day CRX Sling - CRX Embedded Repository com.day.crx.sling.server" bundle.
  2. Copy the ID number of the bundle, this is the number on the left side.
  3. Log in to your server and open a command prompt.
  4. Change directories to the location where the bundle is stored (<id> is the id number from step 2):
    cd crx-quickstart/launchpad/felix/bundle<id>
  5. Change directories to where the embedded jars are persisted using this command (your versionX.Y folder may have a higher version than 0.0):
    cd version0.0/bundle.jar-embedded/
  6. Run this command to extract the tika-config.xml file from the jar file (your jackrabbit-core jar may have a higher version than 2.4.0):
    jar -xvf jackrabbit-core-2.4.0.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  7. Modify the extracted file org/apache/jackrabbit/core/query/lucene/tika-config.xml and modify as needed.  See the attached tika-config.xml for an example.
  8. To update the xml file in the jackrabbit-core jar, run this command:
    jar -uvf jackrabbit-core-2.4.0.jar org/apache/jackrabbit/core/query/lucene/tika-config.xml
  9. Restart CQ for the changes to take effect.

WARNING: By disabling this feature you will no longer be able to find PDF or Office documents by searching CQ using terms contained within the contents of files.

Instructions for CQ5.5/CRX2.3 with service pack 2.1 onwards:

In CQ5.5/CRX2.3 Apache Tika configuration file resides within the jackrabbit-core jar and update to any service pack overwrite this change. From service pack 2.1 provided an configuration option to configure outside the jackrabbit-core jar.

  1. Save the tika-config.xml at <cq_home>/crx-quickstart/repository/workspaces/crx.default/tika-config.xml
  2. Modify SearchIndex element to include tikaConfigPath. Example for SearchIndex element in workspace.xml at [1].
  3. Restart CQ for the changes to take effect.
[1] repository/workspace/crx.default/workspace.xml
<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
     <param name="path" value="${wsp.home}/index"/>
     <param name="resultFetchSize" value="50"/>
     <param name="tikaConfigPath" value="${wsp.home}/tika-config.xml"/>
</SearchIndex>

Rebuild the Search Index:

After making the changes, you will need to rebuild the search index.

  1. Stop CQ/CRX
  2. Backup and delete these directories on the server:
    crx-quickstart/repository/repository/index
    crx-quickstart/repository/workspaces/crx.default/index
  3. Start CQ/CRX (IMPORTANT: Re-indexing can take anywhere from 1 hour to 48 hours depending on the amount of content you have in your repository.  Make sure that you have coordinated with your users to have a proper outage window.)

Tải xuống

To download tika-config.xml for CRX2.3 Click Here

Additional information

Apache Tika is an open source toolkit which detects and extracts metadata and structured content from various file types. &nbsp;It is used by the CRX Lucene search index for text extraction and by CQ DAM for metadata extraction. You can update the tika-config.xml file to add your own custom text extraction implementations and to disable text extractions on binary files such as pdfs and microsoft excel, word, and powerpoint documents.&nbsp; In the case of this article, we disable text extraction on certain file types to reduce CQ's Lucene search index size.

 Adobe

Nhận trợ giúp nhanh chóng và dễ dàng hơn

Bạn là người dùng mới?

Adobe MAX 2024

Adobe MAX
Hội thảo sáng tạo

14–16/10 Bãi biển Miami và trực tuyến

Adobe MAX

Hội thảo sáng tạo

14–16/10 Bãi biển Miami và trực tuyến

Adobe MAX 2024

Adobe MAX
Hội thảo sáng tạo

14–16/10 Bãi biển Miami và trực tuyến

Adobe MAX

Hội thảo sáng tạo

14–16/10 Bãi biển Miami và trực tuyến