Text mining in digital collections

advertisement
Introduction
Text Mining
Classification
Bridgeman Digital Art Library
Bridgeman Categories
Sample Classification Data
Text mining in digital collections
CHASE: Going digital
Deirdre Lungley
dmlung@essex.ac.uk
February 6, 2013
Deirdre Lungley
Introduction
Text Mining
Classification
Bridgeman Digital Art Library
Bridgeman Categories
Sample Classification Data
Text mining in digital collections
Deirdre Lungley
Introduction
Text Mining
Classification
Bridgeman Digital Art Library
Bridgeman Categories
Sample Classification Data
Text mining in digital collections
Bridgeman Categories
2
7
9
12
15
18
20
21
22
24
25
27
29
30
31
33
35
38
40
Oriental Miniatures
Maps
Posters
Arms, Armour & Militaria
Botanical
Clocks, Watches, Barometers & Sundials
Costume & Fashion
Enamels
Ephemera
Furniture
Glass
Icons
Inventions
Jewellery (see also Semi-precious stones)
Juvenilia / Children's Toys & Games
Lighting
Medicine
Mythology Mythological Myth
Animals
Deirdre Lungley
41
44
46
47
51
56
1126
5000
5001
5002
5003
5004
5005
5006
5007
5010
5011
5013
Mosaics
Semi-precious Stones (see also Jewellery)
Science
Sculpture
Sports and Leisure
Trade Emblems, City Crests, Coats of Arms
CHOIR BOOKS
The Arts and Entertainment
Ancient and World Cultures
Architecture
Business and Industry
Places
Science and Medicine
History
Religion and Belief
Travel and Transport
Plants and Animals
Emotions and Ideas
Introduction
Text Mining
Classification
Bridgeman Digital Art Library
Bridgeman Categories
Sample Classification Data
Text mining in digital collections
Sample Classification Data
Query/Clicked URL
monster woman
Dulle Griet raiding Hell
nuno
The Fishermen from the Polyptych of St. Vincent
girl poor
A Peasant Girl Gathering Faggots in a Wood
Gold Standard Annotations
5007 : Religion and Belief
5 : Allegory / Allegorical
38 : Mythology Mythological Myth
5007 : Religion and Belief
42 : Personalities
5009 : People and Society
Deirdre Lungley
Classifier Predictions
5007 : Religion and Belief
5007 : Religion and Belief
5012 : Land and Sea
42 : Personalities
5009 : People and Society
5012 : Land and Sea
Introduction
Text Mining
Classification
Python & NLTK
Web Services
Sample Code (1) – Wikify text
Text mining in digital collections
Tools of the trade
Python:
High level language
Many standard libraries, e.g., XML parser
Natural Language Toolkit (NLTK):
A platform for building Python programs to work with human
language data (nltk.org)
Why?
Glue between applications
Data preparation for tools such as Weka
Allows programmatic access to web services
Deirdre Lungley
Introduction
Text Mining
Classification
Python & NLTK
Web Services
Sample Code (1) – Wikify text
Text mining in digital collections
Example Web Service – WikipediaMiner
Deirdre Lungley
Introduction
Text Mining
Classification
Python & NLTK
Web Services
Sample Code (1) – Wikify text
Text mining in digital collections
Sample Python XML parsing – Wikify RSS title
Deirdre Lungley
Introduction
Text Mining
Classification
Python & NLTK
Web Services
Sample Code (1) – Wikify text
Text mining in digital collections
Sample Python XML parsing – Wikify RSS title (Output)
Deirdre Lungley
Introduction
Text Mining
Classification
Python & NLTK
Web Services
Sample Code (1) – Wikify text
Text mining in digital collections
Deirdre Lungley
Introduction
Text Mining
Classification
Python & NLTK
Web Services
Sample Code (1) – Wikify text
Text mining in digital collections
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)
Features - present/absent for a given label
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)
Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)
Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectors
Used extensively in research
Here – treat as black box – default settings
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Supervised Learning - Basics
Classifier (Model) built from:
Positive/Negative examples (labelled data)
Features - present/absent for a given label
Test data built using:
Present/absent classifier features
Case Study - Support Vector Machine (SVM) Classifier:
Locates marginal points on hyperplane - support vectors
Used extensively in research
Here – treat as black box – default settings
SVMLight data format:
< target >< feature >:< value > ... < feature >:< value >
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training
Examples
Test
Examples
Feature Extractor
Pos/Neg
labelled
feature
sets
Test
feature
sets
Learning
tool
Classifier
model
Predictions
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Project Gutenberg Catalogue
BBC RSS Feed
Training
Examples
Test
Examples
Feature Extractor
Pos/Neg
labelled
feature
sets
Training
Data
SVM_Learn
Learning
tool
Test
Data
Test
feature
sets
SVM_Classify
Classifier
model
Predictions
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Data – Project Gutenberg
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Case Study Task: Classify BBC RSS feeds
Retrieve & parse BBC RSS feed
Create Classification Features
Casefolding
Tokenisation
Stemming
Stopwords
Classify (test data → predictions)
Output to file on disk
Call command
Read file
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Retrieve & parse RSS feed
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Retrieve & parse RSS feed (Output)
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Text to Features
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Text to Features (Output)
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Classify: Test data → predictions (Output)
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Training Data – Project Gutenberg
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Create training data (Output)
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
References:
The Regex Coach
Deirdre Lungley
Introduction
Text Mining
Classification
Supervised Learning - Basics
Sample Code (2) – Classify BBC RSS Feeds
Text mining in digital collections
Thank You!
Deirdre Lungley
Download