Books and Readings: All books, papers or reports will be available

advertisement
Books and Readings:
All books, papers or reports will be available to students in one of two ways: 1) in the
USC bookstore; and 2) via the web.
Required Reading:
Tika in Action (available online from Manning Publications)
Class Structure & Schedule:
Class sequence, dates, topics and guest speakers are subject to change as the semester
proceeds. Any revisions will be noted and announced in class in advance.
Topics/Activities
Readings & Homework
Deliverables
Wk1
Digital File Formats (Taxonomy of File
Formats; Structured Text; Universal
Metadata) and Big Data Introduction.
Apache Tika Introduction.
Tika in Action Chapter
1
Wk2
Content Detection Libraries and Installation.
Installing Apache Tika.
Tika in Action Chapter
2
Wk3
Scale and Growth of Content. Search
Engines and their use of Content Detection
and Analysis. Machine Learning and
Content Detection.
Tika in Action Chapter
3
Team
Formation
Discussion
Wk4
Document Type Detection (Internet Media
Types; Diagnosing File Formats; IANA
Taxonomy of MIME types)
Tika in Action Chapter
4
Project 1
Assigned
Wk5
Content Extraction (Full Text, Streaming
Parsing, Structured Output, Context
Sensitive Parsing)
Tika in Action Chapter
5
Wk6
Understanding Metadata (Metadata
Standards; Metadata Quality; Practical uses)
Tika in Action Chapter
6
Wk7
Language Detection and Translation
Tika in Action Chapter
7
Wk8
File Formats and Representation (Scientific
Data; Text; XML-based formats); File
Headers and Naming Convention; Storage
Tika in Action Chapter
8
Project 1
Due; Project
2 Assigned
Wk9
Large Scale Content Detection (Apache
Spark™, Hadoop, and Tika); Integrating
Content Software
Tika in Action Chapter
9
Wk10 Survey of Open Source Content Detection
Technologies (Textract; Scrapy; Droids;
Mahout) and other Parser libraries
Tika in Action Chapter
10
Wk11 Advanced Media Type Detection
Algorithms (Bayesian; Byte Histograms)
Tika in Action Chapter
11
Wk12 Searching Scientific Datasets
Tika in Action Chapter
12 and 14
Wk13 Content Management Systems
Tika in Action Chapter
13
Wk14 Public Datasets for Content Extraction
(Public Terabyte Dataset; Amazon AWS
public data; DARPA Memex and DARPA
XDATA)
Tika in Action Chapter
15
Wk15 Named Entity Recognition; Begin Review
for Final Exam
Discussion of Apache
UIMA, cTAKES and
OpenNLP; Stanford
NLTK and other toolkits
Wk16 Summary and Review
Project 2
Due; Project
3 Assigned
Final Exam
Project 3
Due
Download