Natural Language Processing for LODLAM A brief intro to machine learning & data science for Libraries Presented at IGeLU 2014 by Corey A Harper 2014-09-16 Context Narrative Story telling The Library's story, and the Archives story, but also… Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Users’ stories Scholars' stories Adding context through recombinant metadata Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Scholars & Users Stories – Tim Sherratt (@wragge) Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/ Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Library Authority Data “Include links to other URIs. so that they can discover more things.” Short of providing and linking to URIs, this *is* authority data. This is what our authority files are for. Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked data is about context authorities provide context and yet our controlled vocabs are nearly gone because the interfaces to them were broken Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 The Death of Browse • Next-Gen Discovery Systems don't make use of Authority Control • “Browse” was/is broken as a UI Design • Rich data in Authorities, disconnected from narrative, context, search • Richer “Authority” type data outside libraries... • “Next Gen Next Gen Discovery… Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fuzzy Wuzzy – Seat Geek Fuzzy Wuzzy – Awesome Library from SeatGeek https://github.com/seatgeek/fuzzywuzzy http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Slide courtesy of Doug Oard Univ. of Maryland Tools - Natural Language Processing • DBPedia Spotlight https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki • Zemanta: http://www.zemanta.com/?wpst=1 • Open Calais: http://www.opencalais.com/ • Open Refine: http://openrefine.org/ • DataTXT: https://dandelion.eu/products/datatxt/ • AlchemyAPI: http://www.alchemyapi.com/ • FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Linked Jazz Back End Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Primo PNX and Authorities • Indexing Cross References • New Browse Functionality • Authority Control from Aleph / Alma • What about non-MARC, or nonAleph Data? • Matching Strings to Authorities Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Enter Open Refine http://freeyourmetadata.org/ Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Match strings to vocabularies… Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Like LCNAF… Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Or Wikipedia Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Automated Authority Control? Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Open Refine RDF Skeleton Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Proposed System Architecture Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Hydra Modeling & Architecture • Approaches to Provenance • Prov-O • Named Graphs • Named Datastreams • “n” nyucore “records” • Same properties defined for each • Keep data sources separate • Merge for display in Blacklight & export to Primo Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Separate Metadata Datastreams • source_metadata, enrich_metadata • Reload one or both without affecting other or native metadata • native_metadata • Edited only through Hydra UI • Partitioned from external sources Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Metadata Provenance Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fedora Datastreams Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Blacklight User Interface Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where does this lead? We need new interfaces new tools for new kind of catalogers for knowledge organization experts Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 A Role for Ex Libris • Alma &/or Primo • Named Entity Recognition • Vocabulary Reconciliation • Provenance Management • Primo Central • Named Entity Recognition on Full Text • Auto Classification A bit louder... we need new interfaces we need enterprise tools Integrated into our metadata management systems for new kind of catalogers for knowledge organization experts Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simplified Workflow Proposal Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Tools – At Programming Level • Open NLP: https://opennlp.apache.org/ • Stanford Natural Language Toolkit: http://nlp.stanford.edu/software/index.shtml • Python Tools • SciKitLearn, Pandas, NLTK, SciPi, NumPi • https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience • http://pandas.pydata.org/ • http://www.nltk.org/ Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 More Data Science-ey Tools http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Techniques • Feature Extraction / Feature Engineering • Predictive Modeling • Probabilistic Classification – Large Multi-Class Problems • Text Analytics • Vectorization • Bags & Sets of Words • TF/IDF • N-Grams • Sparse Matrices Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Simple Example – Predict Yelp Star Ratings Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Fitting a Model – Naïve Bayes Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 1 + ln πππ‘ππ π·πππ’ππππ‘ πΆππ’ππ‘ π·πππ’ππππ‘π πΆπππ‘ππππππ ππππ http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323 Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Where can we go from here? • NER is just the beginning • Feature Engineering • Hiring Statisticians • Clustering & Classification • Vocabulary Pruning and Engineering • Manageable 10-20k Class Text Classification Problems • Domain Specific • Ex Libris’ Activity in this space Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014 Thanks! corey.harper@nyu.edu 212.998.2479 @chrpr Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014