Class used to extract phone numbers while parsing.
Every time a document is parsed in Tika, the content is split into SAX events.
Those SAX events are handled by a ContentHandler. You can think of these events
as marking a tag in an HTML file. Once you're finished parsing, you can call
handler.toString(), for example, to get the text contents of the file. On the other
hand, any of the metadata of the file will be added to the Metadata object passed
in during the parse() call. So, the Parser class sends metadata to the Metadata
object and content to the ContentHandler.
This class is an example of how to combine a ContentHandler and a Metadata.
As content is passed to the handler, we first check to see if it matches a
textual pattern for a phone number. If the extracted content is a phone number,
we add it to the metadata under the key "phonenumbers". So, if you used this
ContentHandler when you parsed a document, then called
metadata.getValues("phonenumbers"), you would get an array of Strings of phone
numbers found in the document.
Please see the PhoneExtractingContentHandlerTest for an example of how to use
public PhoneExtractingContentHandler(org.xml.sax.ContentHandler handler,
Creates a decorator for the given SAX event handler and Metadata object.
handler - SAX event handler to be decorated
public void characters(char ch,
The characters method is called whenever a Parser wants to pass raw...
characters to the ContentHandler. But, sometimes, phone numbers are split
accross different calls to characters, depending on the specific Parser
used. So, we simply add all characters to a StringBuilder and analyze it
once the document is finished.
characters in interface org.xml.sax.ContentHandler