Parsing large, broken, or malicious input causes excessive memory or CPU use during indexing.

In the worst cases, JVM crashes occur.

Use Tika parsers in separate JVM processes.

To better protect against such cases and to generally improve the manageability of resource consumption by Tika, run Tika parsers in separate JVM processes. For implementation, see https://issues.apache.org/jira/browse/TIKA-416

This feature allows full text indexing of binary documents in separate JVM processes. That way, problems caused by parsing large or malformed documents don't affect the main CRX or CQ process. This feature increases overall reliability and stability of the system, as indexing problems are better isolated.

The feature works by automatically starting a pool of background processes dedicated to full text extraction.

Start these processes using the Java command specified as the "forkJavaCommand" parameter of the <SearchIndex/> section in the relevant workspace or repository configuration files (usually crx-quickstart/repository/workspaces/crx.default/workspace.xml).

Set this value, for example, to "java -Xmx512m" to enable this feature.

Additional tip

The forked text extraction processes can't directly harm the parent process. However, they can consume much CPU time especially when processing lots of large PDF documents. To prevent or control the effect on overall system performance, prepend the forkJavaCommand option with "nice" or another platform-specific command for reducing the execution priority of the forked extraction processes. The recommended values for Linux and Windows are:

  Linux: "nice java -Xmx512m"
  Windows: "cmd /c start /low /wait /b java -Xmx512m"

Windows:

        <SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
            <param name="path" value="${wsp.home}/index"/>
            <param name="resultFetchSize" value="100"/>
            <param name="cacheSize" value="100000" />
            <param name="forkJavaCommand" value="cmd /c start /low /wait /b java -Xmx512m"/>
        </SearchIndex>

Linux/Unix:

        <SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
            <param name="path" value="${wsp.home}/index"/>
            <param name="resultFetchSize" value="100"/>
            <param name="cacheSize" value="100000" />
            <param name="forkJavaCommand" value="nice java -Xmx512m"/>
        </SearchIndex>

Applies to

CRX 2.2, CRX 2.1 with hotfixpack >= 2.1.0.9 installed

Note:

In Linux, the nice and java commands only work if they are accessible in the Path variable of the CQ user's shell session.

In Windows, the same applies for the java command, the PATH variable must include the correct java version's bin folder.

See here for details on how to set the "Path" variable in for java.

Additional information

By default, Tika parsers that take care of this extraction run in the same Java process as CQ, which can lead to the described symptoms when encountering specific use cases.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License  Twitter™ and Facebook posts are not covered under the terms of Creative Commons.

Legal Notices   |   Online Privacy Policy