Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application Nick Burch Software Engineer Alfresco Apache Tika • • • • • • http://tika.apache.org/ Project which started in 2006 Grew out of the Lucene community, now widely used Provides detection of files – eg this binary blob is really a word file, that one is UTF-8 plain text Plain text, HTML and XHTML versions of a wide range of different file formats Consistent Metadata from different files Tika hides the complexity of the different formats and their libraries, What's new? • Lots of new parsers – text, office formats, publishing formats, images, audio, CAD, fonts etc • Long standing parsers improved – better HTML from word for example • Embedded resources and containers • Use expanding – used by many SOLR users, Alfresco, lots of people crunching masses of data on Hadoop Supported Formats Page 1 • • • • • • • Audio – WAV, RIFF, MIDI DWG (CAD) Epub RSS and ATOM Feeds True Type Fonts HTML Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found) • iWork (Keynote, Pages etc) • RFC822 mbox Mail Supported Formats Page 2 • Microsoft Outlook .msg Email • Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works • Microsoft Office (OOXML) – Word, PowerPoint, Excel • MP3 (id3 v1 and v2) • CDF (Scientific Data) • Open Document Format (Open Office) • Old-style Open Office (.sxw etc) Supported Formats Page 3 • • • • • • Zip and Tar archives RDF Plain Text FLV Video XML Java class files And I probably forgot one...! Metadata • Tika provides consistent metadata across the range of parsers • No need to know if it's “Last Author”, “Last Editor” or “Previous Author” in a file format, they all come back with the same metadata key • Keys and values are strings, but strongly typed metadata entries provide converters to dates, ints etc Text Content • Tika generates HTML-like SAX events as it parses • Uses Java SAX API • Events can be captured or transformed • Body Content Handler used for plain text • HTML and XHTML available • Can customise with your own handler, with XSLT or with E4X from JavaScript • eg HTML Table → CSV Calling Tika // Get a content detector, and an autoselecting Parser TikaConfig config = TikaConfig.getDefaultConfig(); ContainerAwareDetector detector = new ContainerAwareDetector( config.getMimeRepository() ); Parser parser = new AutoDetectParser(detector); // We’ll only want the plain text contents ContentHandler handler = new // Plain text only content handler ContentHandler handler = new BodyContentHandler(); String text = handler.toString(); // XHTML content handler SAXTransformerFactory factory = SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProp erty(OutputKeys.METHOD, "xml"); Tika Parsers Parser Interface • Two key methods – what mime types are supported, and do the parsing public interface Parser { Set<MediaType> getSupportedTypes(ParseContext context); void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world" )); return types; } public void parse(InputStream stream, Demo: Tika-App Demo: Geo-Tagged Images in Alfresco Share via Tika Any Questions?