Apache_Tika_End-to-End

advertisement
Apache Tika
End-to-End
An introduction to Apache Tika,
and integrating it to your application
Nick Burch
Software Engineer
Alfresco
Apache Tika
•
•
•
•
•
•
http://tika.apache.org/
Project which started in 2006
Grew out of the Lucene community,
now widely used
Provides detection of files – eg this
binary blob is really a word file, that
one is UTF-8 plain text
Plain text, HTML and XHTML versions
of a wide range of different file formats
Consistent Metadata from different files
Tika hides the complexity of the
different formats and their libraries,
What's new?
• Lots of new parsers – text, office
formats, publishing formats, images,
audio, CAD, fonts etc
• Long standing parsers improved –
better HTML from word for example
• Embedded resources and containers
• Use expanding – used by many SOLR
users, Alfresco, lots of people
crunching masses of data on Hadoop
Supported Formats Page 1
•
•
•
•
•
•
•
Audio – WAV, RIFF, MIDI
DWG (CAD)
Epub
RSS and ATOM Feeds
True Type Fonts
HTML
Images – JPEG, GIF, PNG, TIFF,
Bitmap (including EXIF where found)
• iWork (Keynote, Pages etc)
• RFC822 mbox Mail
Supported Formats Page 2
• Microsoft Outlook .msg Email
• Microsoft Office (Binary) – Word,
PowerPoint, Excel, Visio, Publisher,
Works
• Microsoft Office (OOXML) – Word,
PowerPoint, Excel
• MP3 (id3 v1 and v2)
• CDF (Scientific Data)
• Open Document Format (Open Office)
• Old-style Open Office (.sxw etc)
Supported Formats Page 3
•
•
•
•
•
•
Zip and Tar archives
RDF
Plain Text
FLV Video
XML
Java class files
And I probably forgot one...!
Metadata
• Tika provides consistent metadata
across the range of parsers
• No need to know if it's “Last Author”,
“Last Editor” or “Previous Author” in a
file format, they all come back with the
same metadata key
• Keys and values are strings, but
strongly typed metadata entries
provide converters to dates, ints etc
Text Content
• Tika generates HTML-like SAX events
as it parses
• Uses Java SAX API
• Events can be captured or transformed
• Body Content Handler used for plain
text
• HTML and XHTML available
• Can customise with your own handler,
with XSLT or with E4X from JavaScript
• eg HTML Table → CSV
Calling Tika
// Get a content detector, and an autoselecting Parser
TikaConfig config =
TikaConfig.getDefaultConfig();
ContainerAwareDetector detector = new
ContainerAwareDetector(
config.getMimeRepository() );
Parser parser = new
AutoDetectParser(detector);
// We’ll only want the plain text contents
ContentHandler handler = new
// Plain text only content handler
ContentHandler handler = new
BodyContentHandler();
String text = handler.toString();
// XHTML content handler
SAXTransformerFactory factory =
SAXTransformerFactory.newInstance();
TransformerHandler handler =
factory.newTransformerHandler();
handler.getTransformer().setOutputProp
erty(OutputKeys.METHOD, "xml");
Tika Parsers
Parser Interface
• Two key methods – what mime types
are supported, and do the parsing
public interface Parser {
Set<MediaType>
getSupportedTypes(ParseContext
context);
void parse(InputStream stream,
ContentHandler handler, Metadata
metadata, ParseContext context)
throws IOException, SAXException,
public class HelloWorldParser
implements Parser {
public Set<MediaType>
getSupportedTypes(ParseContext
context) {
Set<MediaType> types = new
HashSet<MediaType>();
types.add(MediaType.parse("hello/world"
));
return types;
}
public void parse(InputStream stream,
Demo: Tika-App
Demo: Geo-Tagged Images
in Alfresco Share via Tika
Any Questions?
Download