Books and Readings: All books, papers or reports will be available to students in one of two ways: 1) in the USC bookstore; and 2) via the web. Required Reading: Tika in Action (available online from Manning Publications) Class Structure & Schedule: Class sequence, dates, topics and guest speakers are subject to change as the semester proceeds. Any revisions will be noted and announced in class in advance. Topics/Activities Readings & Homework Deliverables Wk1 Digital File Formats (Taxonomy of File Formats; Structured Text; Universal Metadata) and Big Data Introduction. Apache Tika Introduction. Tika in Action Chapter 1 Wk2 Content Detection Libraries and Installation. Installing Apache Tika. Tika in Action Chapter 2 Wk3 Scale and Growth of Content. Search Engines and their use of Content Detection and Analysis. Machine Learning and Content Detection. Tika in Action Chapter 3 Team Formation Discussion Wk4 Document Type Detection (Internet Media Types; Diagnosing File Formats; IANA Taxonomy of MIME types) Tika in Action Chapter 4 Project 1 Assigned Wk5 Content Extraction (Full Text, Streaming Parsing, Structured Output, Context Sensitive Parsing) Tika in Action Chapter 5 Wk6 Understanding Metadata (Metadata Standards; Metadata Quality; Practical uses) Tika in Action Chapter 6 Wk7 Language Detection and Translation Tika in Action Chapter 7 Wk8 File Formats and Representation (Scientific Data; Text; XML-based formats); File Headers and Naming Convention; Storage Tika in Action Chapter 8 Project 1 Due; Project 2 Assigned Wk9 Large Scale Content Detection (Apache Spark™, Hadoop, and Tika); Integrating Content Software Tika in Action Chapter 9 Wk10 Survey of Open Source Content Detection Technologies (Textract; Scrapy; Droids; Mahout) and other Parser libraries Tika in Action Chapter 10 Wk11 Advanced Media Type Detection Algorithms (Bayesian; Byte Histograms) Tika in Action Chapter 11 Wk12 Searching Scientific Datasets Tika in Action Chapter 12 and 14 Wk13 Content Management Systems Tika in Action Chapter 13 Wk14 Public Datasets for Content Extraction (Public Terabyte Dataset; Amazon AWS public data; DARPA Memex and DARPA XDATA) Tika in Action Chapter 15 Wk15 Named Entity Recognition; Begin Review for Final Exam Discussion of Apache UIMA, cTAKES and OpenNLP; Stanford NLTK and other toolkits Wk16 Summary and Review Project 2 Due; Project 3 Assigned Final Exam Project 3 Due