Symptoms
Excessive memory or CPU use, or in the worse cases, JVM crashes, can be caused by parsing very large, broken or intentionally malicious input documents, during the indexing process.
Cause
Per default, the Tika parsers that take care of this extraction run in the same Java process as CQ, which can lead to the described symptoms when encountering specific use cases.
Resolution
To better protect against such cases and to generally improve the manageability of resource consumption by Tika, Tika parsers can now be run in separate JVM processes, with the implementation of https://issues.apache.org/jira/browse/TIKA-416
This feature allows full text indexing of binary documents to be performed in separate JVM processes so that problems caused by parsing large or malformed documents will not affect the main CRX or CQ process. This increases overall reliability and stability of the system, as indexing problems are better isolated.
The feature works by automatically starting a pool of background processes dedicated to full text extraction.
These processes are started using the Java command specified as the "forkJavaCommand" parameter of the <SearchIndex/> section in the relevant workspace or repository configuration files (usually crx-quickstart/repository/workspaces/crx.default/workspace.xml).
This value should be set for example to "java -Xmx32m" to enable this feature.
Additional tip :
Even though the forked text extraction processes can't directly harm the parent process, they can consume much CPU time especially when processing lots of large PDF documents. To prevent or control the effect on overall system performance, it can be a good idea to prepend the forkJavaCommand option with "nice" or another platform-specific command for reducing the execution priority of the forked extraction processes. The recommended values for Linux and Windows are:
Linux: "nice java -Xmx32m"
Windows: "cmd /c start /low /wait /b java -Xmx32m"
<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<param name="resultFetchSize" value="100"/>
<param name="cacheSize" value="100000" />
<param name="forkJavaCommand" value="cmd /c start /low /wait /b java -Xmx32m"/>
</SearchIndex>
<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<param name="resultFetchSize" value="100"/>
<param name="cacheSize" value="100000" />
<param name="forkJavaCommand" value="nice java -Xmx32m"/>
</SearchIndex>
CRX 2.2, CRX 2.1 with hotfixpack >= 2.1.0.9 installed

