Preprocessing and Topic Modeling

advertisement
meow::06
Kat Hagedorn
David Newman
Clustering, Classification,
and Metadata
Enhancement Techniques
July 24, 2006
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
Bill Landis, ex officio
1
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
I.
II.
III.
Preprocessing and Topic Modeling
The “Browser”
Lessons Learned and Next Steps
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
2
Goals
•
•
•
•
Evaluate topical/subject-based metadata enhancement
Experiment on testbed of multiple OAI repositories
Discuss lessons learned and refine testing
Propose products and services
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
3
Preprocessing & Topic Modeling >
What We Did
vocabulary
Cluster
OAI
records
preprocess
topic
model
(cluster/learn)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
topics
4
Preprocessing & Topic Modeling >
What We Did
vocabulary
Cluster
OAI
records
preprocess
topic
model
(cluster/learn)
topics
vocab
-ulary
Classify
oai
rec
preprocess
topic
model
(classify)
1. topics in records
2. records in topics
OAI
records
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
5
Preprocessing & Topic Modeling >
clustering is
learning the
topics
What We Did
vocabulary
Cluster
OAI
records
preprocess
topic
model
(cluster/learn)
topics
vocab
-ulary
Classify
oai
rec
preprocess
topic
model
(classify)
1. topics in records
2. records in topics
OAI
records
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
classification
is using the
learned topics
6
Preprocessing & Topic Modeling >
Repository Selection
• Mix of cultural heritage repositories?
–
–
–
–
UMich, Library of Congress, CDL, State Lib of Victoria (Aust), …
Average of 15 words per record (excl. stopwords)
Topics often specific to collection (e.g., State Lib of Victoria)
Experience with CDL’s American West project
• Mix of scientific/research repositories?
–
–
–
–
CiteSeer, arXiv, PubMed, …
<description> is a reasonably reliable 200-word abstract
Average of 75 words per record
Topics more likely to span repositories
• For purposes of evaluation, used (mostly) Englishlanguage repositories
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
7
Preprocessing & Topic Modeling >
Selected Repositories*
Short
Name
Description
arxiv
arXiv.org Eprint Archive
caltech
Caltech Electronic Theses and Dissertations
cern
CERN Document Server
citeseer
Records
Records used for
clustering (learning)
368,000
1 in 3
3,000
-
45,000
1 in 2
CiteSeer Scientific Literature Digital Library
717,000
1 in 3
doaj
Directory of Open Access Journals Articles
29,000
1 in 2
iop
Institute of Physics
212,000
1 in 3
loc
Library of Congress Digitized Historical Collections
239,000
-
nsdl
The National Science Digital Library
33,000
1 in 2
osti
Office of Science and Technology Information
131,000
1 in 3
pangaea
Publishing Network for Geoscientific and
Environmental Data
370,000
-
pubmed
PubMed Central
625,000
1 in 3
repec
Research Papers in Economics
141,000
1 in 3
Clustering, Classification, and Metadata
*Repositories Enhancement
harvested Techniques
by UMich/OAIster,
June 7, 2006.
on OAI Records
8
Preprocessing & Topic Modeling >
Usage of Dublin Core Fields
• Decided to use words from <title>, <description>,
<subject> for clustering
• Idiosyncrasies
–
–
–
–
–
CiteSeer: repeats <author> and <title> in <subject>
CiteSeer: puts citations to other IDs in <description>
arXiv: puts e.g., “Comment: 12 pages PostScript” in <description>
RePEc: no <subject>, repeats ID in <description>
etc.
• Approach: Process all repositories identically, no special
treatment
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
9
Preprocessing & Topic Modeling >
Preprocessing Example
<ID=oai:CiteSeerPSU:44072>
vocab
-ulary
<title>Reinforcement Learning: A Survey
<description>This paper surveys the field
of reinforcement learning from a computerscience perspective. It is written to be
accessible to researchers familiar with
machine learning. Both the historical basis
of the field and a broad selection of current
work are summarized. Reinforcement
learning is the problem faced by an agent
that learns behavior through trial-and-error
interactions with a dynamic environment.
The work described here has a
resemblance to work in psychology, but
differs considerably in the details and in the
use of the word "reinforcement." …
<ID=oai:CiteSeerPSU:44072>
reinforcement learning survey
preprocess
survey field reinforcement learning
computer science perspective written
accessible researcher familiar machine
learning historical basis field broad
selection current summarized
reinforcement learning faced agent
learn behavior trial error interaction
dynamic environment resemblance
psychology differ considerably detail
word reinforcement …
leslie pack kaelbling littman andrew
moore reinforcement learning survey
<subject>Leslie Pack Kaelbling, Michael
Littman, Andrew Moore. Reinforcement
Learning: A Survey
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
10
Preprocessing & Topic Modeling >
Stopwords and Stemming
• Standard: and, the, …
• Research related: research, paper, data, system,
method, result, …
• Repository specific: cern, citeseer, repec, Smith, …
• All tokens starting with a digit: 1996, 401k, …
• Produced stopword list of 500 words
• Applied very simple stemming (cars  car)
• Note: replacing collocations improves interpretability of
topics, but not quality (los angeles  los_angeles)
• Don’t need to find and exclude all stopwords because
topic model will help find these (e.g. des, les, une, …)
-- suppress after the fact
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
11
Preprocessing & Topic Modeling >
Building Vocabulary
•
•
•
•
Preprocessed (sampled) repositories, excluded stopwords
Only kept words that occurred in more than 10 records
Result: a final vocabulary with ~ 90,000 words
Most frequent words: cell, high, energy, protein, function,
algorithm, field, theory, physics, …
• Resulting discussion point: When do we need to re-create
the vocabulary? (When classifying, new documents will
be filtered through existing vocabulary)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
12
Preprocessing & Topic Modeling >
•
•
•
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
Average of 75 words per
record
Bimodal because used records
with abstracts and records
without abstracts
Topic model isn’t adversely
affected by very short records
13
Preprocessing & Topic Modeling >
Computation
• Clustering (Learning)
D = 750,000 records
W = 90,000 word vocabulary
Decision point: How many topics?
Decision point: How many iterations?
L = 75 words per record
T = 500 topics
iter = 500 iterations
memory = 3DL + T(D+W) = 3 GByte
time = D L T Iter = 3 days (3 GHz Xeon)
• Classification
D = 3,000,000 records total
iter = 40 iterations
max memory = 2 GByte
max time = 5 hours (but repositories can run in parallel)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
14
Preprocessing & Topic Modeling >
Broad Topical Categories
• 500 topics too many to look at
• Need to organize topics under broad topical
categories
– Cluster the clusters (automatic)
– Use pre-defined categories
• Classify group of keywords (manual + automatic)
• Create hierarchy by hand (manual)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
15
Preprocessing & Topic Modeling >
Broad Topical Categories
vocabulary
Cluster
OAI
records
Cluster the
clusters
preprocess
topic
model
(cluster/learn)
topic
model
(cluster/learn)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
topics
broad topical
categories
16
Preprocessing & Topic Modeling >
Broad Topical Categories
vocabulary
Cluster
OAI
records
preprocess
Cluster the
clusters
topic
model
(cluster/learn)
topic
model
(cluster/learn)
topics
broad topical
categories
vocab
-ulary
Classify
group of
keywords
group of
keywords
preprocess
topic
model
(classify)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
topics organized under
broad topical categories
17
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
I.
II.
III.
Preprocessing and Topic Modeling
The “Browser”
Lessons Learned and Next Steps
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
18
The Browser >
The “Browser”
•
•
•
•
•
PHP/MySQL browser of 3 million OAI records*
Preserving transparency for this audience
Browser not meant for end users
No search, no information architecture, etc.
http://yarra.calit2.uci.edu/meow/
Clustering, Classification, and Metadata
*Based on 750,000Enhancement
sampledTechniques
records
from 9 repositories, 500 topics
on OAI Records
19
The Browser >
The “Browser”: http://yarra.calit2.uci.edu/meow/
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
20
The Browser >
Selected Topics: Useful
•
•
•
•
•
•
[ t201 ] learning machine training learn algorithm task examples
reinforcement inductive learned learner supervised unsupervised
[ t482 ] labor worker employment wage market labour job unemployment
wages earning panel find evidence individual participation
[ t381 ] algebraic geometry mathematic conjecture varieties projective
variety theory cohomology moduli curves prove genus rational give math
[ t097 ] dark matter universe astrophysic cosmological cosmic background
density inflation spectrum power scale cmb halo cosmology gravitational
[ t027 ] hiv virus human immunodeficiency type envelope infection viral
cd4 infected gag replication reverse aid tat gp120
[ t365 ] waste radioactive wastes tank nuclear facilities management
hanford disposal fuel storage material processing facility site level
> show all 500 sub-topics (to see all 500 topics)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
21
The Browser >
Selected Topics: Less Useful
•
•
•
•
•
•
[ t255 ] journal author chapter vol notes editor publication issue special
bibliography reader references appendix literature submitted topic
[ t328 ] paul mark thank andrew scott stephen alan steven miller george
martin obituaries thesis daniel prof ian
[ t384 ] supported part grant author foundation partially contract science
national nsf support advanced ccr provided center agency
[ t112 ] look people difficult thing need want fact reason help understand
think say alway try easy bad
[ t496 ] increase increased increases decrease increasing decreased
decreases observed change decreasing significant caused decline
[ t012 ] des les dan une est par sur pour qui nous sont aux ces analyse
pay cette
But junk topics alleviate the need to exhaustively find stopwords;
many useless words cluster as topics which can be suppressed
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
and very useful
to filter out
French records
22
The Browser >
Broad Topical Categories (BTCs)
• By clustering the clusters
– worked well
– mathematics, global energy resources, …
– can choose desired number of broad topical
categories (e.g., 25) and thresholding
• By classifying groups of keywords
– worked well too
• Then review and manually edit
– include or exclude any subtopic
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
23
The Browser >
BTCs: Clustering the clusters
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
24
The Browser >
BTCs: Classifying group of keywords
>>> Aerospace Engineering
stars (15)
space (18)
aeronautics (20)
astronautics (20)
rocket (12)
shuttle (12)
exploration (15)
lander (3)
planets (7)
black holes (7)
quasars (7)
pulsars (7)
observatories (10)
air traffic (10)
aircraft (15)
aerospace (20)
airplanes (10)
airports (10)
heliports (10)
helicopters (10)
aviation (18)
FAA (7)
airlines (12)
flight (18)
comets (10)
meteorites (12)
spacecraft (15)
air force (7)
pilots (7)
jets (7)
air travel (15)
flying (18)
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
domain expert
specifies list of
relevant keywords and
(importance)
25
The Browser >
BTCs: Classifying group of keywords
>>> Aerospace Engineering
[t192] (69%) vehicle flight vehicles engine car road speed nasa aircraft air
[t352] (13%) star solar planet mass astrophysic binary dwarf orbital sun companion
[t191] (8%) space spaces hilbert subspace dimensional subspaces defined exploration linear point
>>> Dermatology
in review,
would delete
this topic
from this
BTC
[t388] (83%) infection skin disease tract respiratory fever burgdorferi caused wound arthritis
[t157] (8%) cancer tumor p53 breast carcinoma survival human tumour malignant prostate
[t071] (7%) growth tuberculosis mycobacterium growing grow igf factor bcg avium
>>> Geology and Earth Sciences
[t121] (73%) geothermal rock seismic energy mountain drilling fluid survey spring yucca
[t268] (12%) sea atmospheric climate ice ocean atmosphere cloud global wind aerosol
>>> Molecular, Cellular and Developmental Biology
[t276] (31%) molecular biological sciences molecules biology molecule quantitative biochemistry basic
[t417] (15%) cell apoptosis cellular death cultured bcl lines hela transfected mediated
[t355] (12%) brain neuron neuronal cortex synaptic cortical rat nervous cerebral dopamine
[t418] (9%) genes genome gene repeat chromosome sequences dna genomic sequence region
[t319] (7%) mice development mouse drosophila expression transgenic cell embryonic embryos gene
>>> Transportation
[t192] (85%) vehicle flight vehicles engine car road speed nasa aircraft air
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
just found 1 topic
relevant to
transportation
26
The Browser >
Browse Records in a Topic
can navigate
back to
multiple BTCs
nice mix of
repositories
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
27
The Browser >
Browse Records in a Topic: From one repository
display records
just from Library
of Congress
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
28
The Browser >
Sample Record
Murphy's Law in algebraic geometry: Badly-behaved deformation spaces
> preprocessed text
murphy law algebraic geometry badly behaved deformation spaces
consider question bad deformation space object answer priori reason deformation space bad moduli spaces
precisely singularity finite type smooth parameter hilbert scheme curves projective space moduli spaces
smooth projective type surfaces higher dimensional varieties plane curves nodes cusp stable sheaves
isolated threefold singularities object pathological fact nice curves smooth surfaces ample canonical bundle
topics for
thissheaves torsion free rank singularities normal cohen macaulay justifies mumford philosophy moduli
stable
spaces behaved object arbitrarily bad priori reason construct smooth curve projective space deformation
record
space component singularity type reduced behavior subschemes similarly give surface f_p lift course hold
holomorphic category difficult compute deformation spaces directly obstruction theories circumvent relating
tractable deformation spaces smooth morphism essential starting point mnev universality theorem
mathematic algebraic geometry mathematic complex variables
> top topics
[ t381 ] algebraic geometry mathematic conjecture varieties projective variety theory cohomology
moduli curves prove genus rational give math
[ t191 ] space spaces
oai:arXiv.org:math/0411469
link to actual
OAI record
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
29
The Browser >
Repository-specific Browsers
•
•
•
•
•
Library of Congress (http://yarra.calit2.uci.edu/oai/loc/)
University of Michigan (http://yarra.calit2.uci.edu/oai/umich/)
University of Washington (http://yarra.calit2.uci.edu/oai/uwash/)
African Journals Online (http://yarra.calit2.uci.edu/oai/africa/)
and many more…
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
30
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
I.
II.
III.
Preprocessing and Topic Modeling
The “Browser”
Lessons Learned and Next Steps
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
31
Lessons Learned & Next Steps >
Evaluation
• Topic modeling worked well
–
–
–
–
Most topics were useful
Drain on computer resources was reasonable
Human effort was relatively small
All repositories processed identically, no special
treatment
• Strategy worked well
– Clustering, then
– Classification, and
– Broad Topical Categories creation
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
32
Lessons Learned & Next Steps >
Further Evaluation
• Current processing only for
– English-language repositories
– Science/research based repositories
• Need to test cultural heritage repositories and foreignlanguage records
– Less consistent descriptive language and length
– “On-the-horse” problem more prevalent
– Greater need to individually process repositories
• Also need usability testing to evaluate further
– Depends on criteria -- who are users?
• Librarians?
• End-users?
– Depends on products and services desired by users
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
33
Lessons Learned & Next Steps >
cluster
classify
classify
classify
cluster
classify
classify
classify
cluster
Discussion Point: When to Re-cluster?
• Need to re-cluster
– when collection changes significantly
– if there is a “hole” in topics
– but NOT just because you have another repository
• If you re-cluster
– all topics will be different
– have to discard hand-labeling
– Broad Topical Categories might be different
• However, classification is
– “cheap” and easy
– e.g., for OAIster, could re-classify every harvest…until spring clean
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
34
Lessons Learned & Next Steps >
Products and Services
•
•
•
•
•
Depending on users…
What kind of service is useful?
What should interface to topics look/act like?
What kind of use should we envision?
We have some ideas…
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
35
Lessons Learned & Next Steps >
Archive of Topics
• Are the topics we created useful to anyone else?
• Scenario: librarian uses topics/classifier for local
resources
• To use locally you need:
– the preprocessor (i.e. the preprocessing rules)
– the vocabulary (file of 90,000 words)
– the topic model classifier
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
36
Lessons Learned & Next Steps >
Subject Search/Browse for OAIster
• Integrate topics into OAIster
– add to records so can perform canned topic search
– add to interface so can browse BTCs to records
• Additionally, can allow users to find records
similar to those retrieved
– e.g., retrieved records on cosmology and can find
similar records on astrophysics, relativity, …
• How to do this?
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
37
How To Reach Us
•
David Newman: University of California, Irvine
<newman@uci.edu>
•
Kat Hagedorn: University of Michigan
<khage@umich.edu>
•
Bill Landis: California Digital Library
<bill.landis@ucop.edu>
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
38
Download