Introduction Text Mining Classification Bridgeman Digital Art Library Bridgeman Categories Sample Classification Data Text mining in digital collections CHASE: Going digital Deirdre Lungley dmlung@essex.ac.uk February 6, 2013 Deirdre Lungley Introduction Text Mining Classification Bridgeman Digital Art Library Bridgeman Categories Sample Classification Data Text mining in digital collections Deirdre Lungley Introduction Text Mining Classification Bridgeman Digital Art Library Bridgeman Categories Sample Classification Data Text mining in digital collections Bridgeman Categories 2 7 9 12 15 18 20 21 22 24 25 27 29 30 31 33 35 38 40 Oriental Miniatures Maps Posters Arms, Armour & Militaria Botanical Clocks, Watches, Barometers & Sundials Costume & Fashion Enamels Ephemera Furniture Glass Icons Inventions Jewellery (see also Semi-precious stones) Juvenilia / Children's Toys & Games Lighting Medicine Mythology Mythological Myth Animals Deirdre Lungley 41 44 46 47 51 56 1126 5000 5001 5002 5003 5004 5005 5006 5007 5010 5011 5013 Mosaics Semi-precious Stones (see also Jewellery) Science Sculpture Sports and Leisure Trade Emblems, City Crests, Coats of Arms CHOIR BOOKS The Arts and Entertainment Ancient and World Cultures Architecture Business and Industry Places Science and Medicine History Religion and Belief Travel and Transport Plants and Animals Emotions and Ideas Introduction Text Mining Classification Bridgeman Digital Art Library Bridgeman Categories Sample Classification Data Text mining in digital collections Sample Classification Data Query/Clicked URL monster woman Dulle Griet raiding Hell nuno The Fishermen from the Polyptych of St. Vincent girl poor A Peasant Girl Gathering Faggots in a Wood Gold Standard Annotations 5007 : Religion and Belief 5 : Allegory / Allegorical 38 : Mythology Mythological Myth 5007 : Religion and Belief 42 : Personalities 5009 : People and Society Deirdre Lungley Classifier Predictions 5007 : Religion and Belief 5007 : Religion and Belief 5012 : Land and Sea 42 : Personalities 5009 : People and Society 5012 : Land and Sea Introduction Text Mining Classification Python & NLTK Web Services Sample Code (1) – Wikify text Text mining in digital collections Tools of the trade Python: High level language Many standard libraries, e.g., XML parser Natural Language Toolkit (NLTK): A platform for building Python programs to work with human language data (nltk.org) Why? Glue between applications Data preparation for tools such as Weka Allows programmatic access to web services Deirdre Lungley Introduction Text Mining Classification Python & NLTK Web Services Sample Code (1) – Wikify text Text mining in digital collections Example Web Service – WikipediaMiner Deirdre Lungley Introduction Text Mining Classification Python & NLTK Web Services Sample Code (1) – Wikify text Text mining in digital collections Sample Python XML parsing – Wikify RSS title Deirdre Lungley Introduction Text Mining Classification Python & NLTK Web Services Sample Code (1) – Wikify text Text mining in digital collections Sample Python XML parsing – Wikify RSS title (Output) Deirdre Lungley Introduction Text Mining Classification Python & NLTK Web Services Sample Code (1) – Wikify text Text mining in digital collections Deirdre Lungley Introduction Text Mining Classification Python & NLTK Web Services Sample Code (1) – Wikify text Text mining in digital collections Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Supervised Learning - Basics Classifier (Model) built from: Positive/Negative examples (labelled data) Features - present/absent for a given label Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Supervised Learning - Basics Classifier (Model) built from: Positive/Negative examples (labelled data) Features - present/absent for a given label Test data built using: Present/absent classifier features Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Supervised Learning - Basics Classifier (Model) built from: Positive/Negative examples (labelled data) Features - present/absent for a given label Test data built using: Present/absent classifier features Case Study - Support Vector Machine (SVM) Classifier: Locates marginal points on hyperplane - support vectors Used extensively in research Here – treat as black box – default settings Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Supervised Learning - Basics Classifier (Model) built from: Positive/Negative examples (labelled data) Features - present/absent for a given label Test data built using: Present/absent classifier features Case Study - Support Vector Machine (SVM) Classifier: Locates marginal points on hyperplane - support vectors Used extensively in research Here – treat as black box – default settings SVMLight data format: < target >< feature >:< value > ... < feature >:< value > Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Training Examples Test Examples Feature Extractor Pos/Neg labelled feature sets Test feature sets Learning tool Classifier model Predictions Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Project Gutenberg Catalogue BBC RSS Feed Training Examples Test Examples Feature Extractor Pos/Neg labelled feature sets Training Data SVM_Learn Learning tool Test Data Test feature sets SVM_Classify Classifier model Predictions Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Training Data – Project Gutenberg Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Case Study Task: Classify BBC RSS feeds Retrieve & parse BBC RSS feed Create Classification Features Casefolding Tokenisation Stemming Stopwords Classify (test data → predictions) Output to file on disk Call command Read file Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Retrieve & parse RSS feed Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Retrieve & parse RSS feed (Output) Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Text to Features Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Text to Features (Output) Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Classify: Test data → predictions (Output) Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Training Data – Project Gutenberg Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Create training data (Output) Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections References: The Regex Coach Deirdre Lungley Introduction Text Mining Classification Supervised Learning - Basics Sample Code (2) – Classify BBC RSS Feeds Text mining in digital collections Thank You! Deirdre Lungley