How to exclude certain paths of content from being indexed

Question

How can I stop CRX from adding/indexing specific trees of my content in the search index?

 

Answer, Resolution

By default, CRX does not include a means of excluding certain paths from being indexed, however with following trick it can be done:

  1. First review the steps for how to make changes to your indexing_config.xml file.
  2. Add the following rules at the beginning of your indexing_config.xml file :
    <index-rule nodeType="nt:base" condition="@excludefromindex='true'" />
    <index-rule nodeType="nt:base" condition="ancestor::*/@excludefromindex='true'" />


    This will exclude nodes with the property 'excludefromindex' set to true, and for all their sub-nodes.

  3. The next step is to add the 'excludefromindex' property to the top node of the paths you want to exclude from indexing.

    Since nodes under /content are cq:Page nodes and properties cannot be added to cq:Page nodes, we'll define a mixin node type with the 'excludefromindex'
    property.  When you add the mixin type to the cq:Page node it will also automatically add the 'excludefromindex' property.

    To create the mixin:
    a. Go to /crx/index.jsp web app and login as admin (if you are using CRX2.3 or later then go to /crx/explorer/index.jsp)
    b. Click on "Node Type Administration"
    c. In CRX Nodetype Administration tool, create a mixin type that has a single property 'excludefromindex' of String type with default value "true".
    d. Set the AutoCreate flag of the property to True.
    e. Using Content Explorer, add the mixin type to the top level cq:Page nodes of the site you want to exclude from search.

  4. At this point, you are still not done with the process.  Even though you have added the node type, the content still exists in the search index.  To remove the content from the search index will need to reindex the content tree.

    In order to do this, you have the following options: 
  • Rebuild the lucene search index:
    a. Stop CRX
    b. Backup and delete crx-quickstart/repository/workspaces/crx.default/index
    c. Start CRX (this process can take a very long time, 1-48 hours, depending on the size of your repository; plan accordingly).
  • Or use the attached 'touch_tree.jsp' to 'touch' the part of the content you'd like to re-index (This will not work in CRX2.3+, it only works in CRX2.2.x and older versions):

    a. To run touch_tree.jsp, first it has to be added to the CRX web application.  Copy the file under crx-quickstart/server/runtime/0/_crx/config/.
    b. Go to http://localhost:4502/crx and login as admin.
    c. Go to http://localhost:4502/crx/config/touch_tree.jsp
    d. Enter a path and run the touch process.

    This script reads every node/property in the tree and writes back the same data. As a result, this causes that content to be reindexed.  Note that if you are using the default persistence configuration for CRX (Tar Persistence Manager), then tar files will grow quite a bit.  Also during this process you may see InvalidItemStateException if other writes are being done on CRX while touch_tree.jsp is running.

WARNING: This should not be done in a CQ author instance because it will break the reference search that occurs when you try to move a page or asset.  This could cause invalid links to show up in your site over time.  However, this process is safe to use in a CQ publish instance.

[1] http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/SearchIndexingConfig.html

 

Applies to

CRX 2.x

Download