meow::06 Kat Hagedorn David Newman Clustering, Classification, and Metadata Enhancement Techniques July 24, 2006 Clustering, Classification, and Metadata Enhancement Techniques on OAI Records Bill Landis, ex officio 1 Clustering, Classification, and Metadata Enhancement Techniques on OAI Records I. II. III. Preprocessing and Topic Modeling The “Browser” Lessons Learned and Next Steps Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 2 Goals • • • • Evaluate topical/subject-based metadata enhancement Experiment on testbed of multiple OAI repositories Discuss lessons learned and refine testing Propose products and services Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 3 Preprocessing & Topic Modeling > What We Did vocabulary Cluster OAI records preprocess topic model (cluster/learn) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records topics 4 Preprocessing & Topic Modeling > What We Did vocabulary Cluster OAI records preprocess topic model (cluster/learn) topics vocab -ulary Classify oai rec preprocess topic model (classify) 1. topics in records 2. records in topics OAI records Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 5 Preprocessing & Topic Modeling > clustering is learning the topics What We Did vocabulary Cluster OAI records preprocess topic model (cluster/learn) topics vocab -ulary Classify oai rec preprocess topic model (classify) 1. topics in records 2. records in topics OAI records Clustering, Classification, and Metadata Enhancement Techniques on OAI Records classification is using the learned topics 6 Preprocessing & Topic Modeling > Repository Selection • Mix of cultural heritage repositories? – – – – UMich, Library of Congress, CDL, State Lib of Victoria (Aust), … Average of 15 words per record (excl. stopwords) Topics often specific to collection (e.g., State Lib of Victoria) Experience with CDL’s American West project • Mix of scientific/research repositories? – – – – CiteSeer, arXiv, PubMed, … <description> is a reasonably reliable 200-word abstract Average of 75 words per record Topics more likely to span repositories • For purposes of evaluation, used (mostly) Englishlanguage repositories Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 7 Preprocessing & Topic Modeling > Selected Repositories* Short Name Description arxiv arXiv.org Eprint Archive caltech Caltech Electronic Theses and Dissertations cern CERN Document Server citeseer Records Records used for clustering (learning) 368,000 1 in 3 3,000 - 45,000 1 in 2 CiteSeer Scientific Literature Digital Library 717,000 1 in 3 doaj Directory of Open Access Journals Articles 29,000 1 in 2 iop Institute of Physics 212,000 1 in 3 loc Library of Congress Digitized Historical Collections 239,000 - nsdl The National Science Digital Library 33,000 1 in 2 osti Office of Science and Technology Information 131,000 1 in 3 pangaea Publishing Network for Geoscientific and Environmental Data 370,000 - pubmed PubMed Central 625,000 1 in 3 repec Research Papers in Economics 141,000 1 in 3 Clustering, Classification, and Metadata *Repositories Enhancement harvested Techniques by UMich/OAIster, June 7, 2006. on OAI Records 8 Preprocessing & Topic Modeling > Usage of Dublin Core Fields • Decided to use words from <title>, <description>, <subject> for clustering • Idiosyncrasies – – – – – CiteSeer: repeats <author> and <title> in <subject> CiteSeer: puts citations to other IDs in <description> arXiv: puts e.g., “Comment: 12 pages PostScript” in <description> RePEc: no <subject>, repeats ID in <description> etc. • Approach: Process all repositories identically, no special treatment Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 9 Preprocessing & Topic Modeling > Preprocessing Example <ID=oai:CiteSeerPSU:44072> vocab -ulary <title>Reinforcement Learning: A Survey <description>This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." … <ID=oai:CiteSeerPSU:44072> reinforcement learning survey preprocess survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement … leslie pack kaelbling littman andrew moore reinforcement learning survey <subject>Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 10 Preprocessing & Topic Modeling > Stopwords and Stemming • Standard: and, the, … • Research related: research, paper, data, system, method, result, … • Repository specific: cern, citeseer, repec, Smith, … • All tokens starting with a digit: 1996, 401k, … • Produced stopword list of 500 words • Applied very simple stemming (cars car) • Note: replacing collocations improves interpretability of topics, but not quality (los angeles los_angeles) • Don’t need to find and exclude all stopwords because topic model will help find these (e.g. des, les, une, …) -- suppress after the fact Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 11 Preprocessing & Topic Modeling > Building Vocabulary • • • • Preprocessed (sampled) repositories, excluded stopwords Only kept words that occurred in more than 10 records Result: a final vocabulary with ~ 90,000 words Most frequent words: cell, high, energy, protein, function, algorithm, field, theory, physics, … • Resulting discussion point: When do we need to re-create the vocabulary? (When classifying, new documents will be filtered through existing vocabulary) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 12 Preprocessing & Topic Modeling > • • • Clustering, Classification, and Metadata Enhancement Techniques on OAI Records Average of 75 words per record Bimodal because used records with abstracts and records without abstracts Topic model isn’t adversely affected by very short records 13 Preprocessing & Topic Modeling > Computation • Clustering (Learning) D = 750,000 records W = 90,000 word vocabulary Decision point: How many topics? Decision point: How many iterations? L = 75 words per record T = 500 topics iter = 500 iterations memory = 3DL + T(D+W) = 3 GByte time = D L T Iter = 3 days (3 GHz Xeon) • Classification D = 3,000,000 records total iter = 40 iterations max memory = 2 GByte max time = 5 hours (but repositories can run in parallel) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 14 Preprocessing & Topic Modeling > Broad Topical Categories • 500 topics too many to look at • Need to organize topics under broad topical categories – Cluster the clusters (automatic) – Use pre-defined categories • Classify group of keywords (manual + automatic) • Create hierarchy by hand (manual) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 15 Preprocessing & Topic Modeling > Broad Topical Categories vocabulary Cluster OAI records Cluster the clusters preprocess topic model (cluster/learn) topic model (cluster/learn) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records topics broad topical categories 16 Preprocessing & Topic Modeling > Broad Topical Categories vocabulary Cluster OAI records preprocess Cluster the clusters topic model (cluster/learn) topic model (cluster/learn) topics broad topical categories vocab -ulary Classify group of keywords group of keywords preprocess topic model (classify) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records topics organized under broad topical categories 17 Clustering, Classification, and Metadata Enhancement Techniques on OAI Records I. II. III. Preprocessing and Topic Modeling The “Browser” Lessons Learned and Next Steps Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 18 The Browser > The “Browser” • • • • • PHP/MySQL browser of 3 million OAI records* Preserving transparency for this audience Browser not meant for end users No search, no information architecture, etc. http://yarra.calit2.uci.edu/meow/ Clustering, Classification, and Metadata *Based on 750,000Enhancement sampledTechniques records from 9 repositories, 500 topics on OAI Records 19 The Browser > The “Browser”: http://yarra.calit2.uci.edu/meow/ Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 20 The Browser > Selected Topics: Useful • • • • • • [ t201 ] learning machine training learn algorithm task examples reinforcement inductive learned learner supervised unsupervised [ t482 ] labor worker employment wage market labour job unemployment wages earning panel find evidence individual participation [ t381 ] algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math [ t097 ] dark matter universe astrophysic cosmological cosmic background density inflation spectrum power scale cmb halo cosmology gravitational [ t027 ] hiv virus human immunodeficiency type envelope infection viral cd4 infected gag replication reverse aid tat gp120 [ t365 ] waste radioactive wastes tank nuclear facilities management hanford disposal fuel storage material processing facility site level > show all 500 sub-topics (to see all 500 topics) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 21 The Browser > Selected Topics: Less Useful • • • • • • [ t255 ] journal author chapter vol notes editor publication issue special bibliography reader references appendix literature submitted topic [ t328 ] paul mark thank andrew scott stephen alan steven miller george martin obituaries thesis daniel prof ian [ t384 ] supported part grant author foundation partially contract science national nsf support advanced ccr provided center agency [ t112 ] look people difficult thing need want fact reason help understand think say alway try easy bad [ t496 ] increase increased increases decrease increasing decreased decreases observed change decreasing significant caused decline [ t012 ] des les dan une est par sur pour qui nous sont aux ces analyse pay cette But junk topics alleviate the need to exhaustively find stopwords; many useless words cluster as topics which can be suppressed Clustering, Classification, and Metadata Enhancement Techniques on OAI Records and very useful to filter out French records 22 The Browser > Broad Topical Categories (BTCs) • By clustering the clusters – worked well – mathematics, global energy resources, … – can choose desired number of broad topical categories (e.g., 25) and thresholding • By classifying groups of keywords – worked well too • Then review and manually edit – include or exclude any subtopic Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 23 The Browser > BTCs: Clustering the clusters Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 24 The Browser > BTCs: Classifying group of keywords >>> Aerospace Engineering stars (15) space (18) aeronautics (20) astronautics (20) rocket (12) shuttle (12) exploration (15) lander (3) planets (7) black holes (7) quasars (7) pulsars (7) observatories (10) air traffic (10) aircraft (15) aerospace (20) airplanes (10) airports (10) heliports (10) helicopters (10) aviation (18) FAA (7) airlines (12) flight (18) comets (10) meteorites (12) spacecraft (15) air force (7) pilots (7) jets (7) air travel (15) flying (18) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records domain expert specifies list of relevant keywords and (importance) 25 The Browser > BTCs: Classifying group of keywords >>> Aerospace Engineering [t192] (69%) vehicle flight vehicles engine car road speed nasa aircraft air [t352] (13%) star solar planet mass astrophysic binary dwarf orbital sun companion [t191] (8%) space spaces hilbert subspace dimensional subspaces defined exploration linear point >>> Dermatology in review, would delete this topic from this BTC [t388] (83%) infection skin disease tract respiratory fever burgdorferi caused wound arthritis [t157] (8%) cancer tumor p53 breast carcinoma survival human tumour malignant prostate [t071] (7%) growth tuberculosis mycobacterium growing grow igf factor bcg avium >>> Geology and Earth Sciences [t121] (73%) geothermal rock seismic energy mountain drilling fluid survey spring yucca [t268] (12%) sea atmospheric climate ice ocean atmosphere cloud global wind aerosol >>> Molecular, Cellular and Developmental Biology [t276] (31%) molecular biological sciences molecules biology molecule quantitative biochemistry basic [t417] (15%) cell apoptosis cellular death cultured bcl lines hela transfected mediated [t355] (12%) brain neuron neuronal cortex synaptic cortical rat nervous cerebral dopamine [t418] (9%) genes genome gene repeat chromosome sequences dna genomic sequence region [t319] (7%) mice development mouse drosophila expression transgenic cell embryonic embryos gene >>> Transportation [t192] (85%) vehicle flight vehicles engine car road speed nasa aircraft air Clustering, Classification, and Metadata Enhancement Techniques on OAI Records just found 1 topic relevant to transportation 26 The Browser > Browse Records in a Topic can navigate back to multiple BTCs nice mix of repositories Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 27 The Browser > Browse Records in a Topic: From one repository display records just from Library of Congress Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 28 The Browser > Sample Record Murphy's Law in algebraic geometry: Badly-behaved deformation spaces > preprocessed text murphy law algebraic geometry badly behaved deformation spaces consider question bad deformation space object answer priori reason deformation space bad moduli spaces precisely singularity finite type smooth parameter hilbert scheme curves projective space moduli spaces smooth projective type surfaces higher dimensional varieties plane curves nodes cusp stable sheaves isolated threefold singularities object pathological fact nice curves smooth surfaces ample canonical bundle topics for thissheaves torsion free rank singularities normal cohen macaulay justifies mumford philosophy moduli stable spaces behaved object arbitrarily bad priori reason construct smooth curve projective space deformation record space component singularity type reduced behavior subschemes similarly give surface f_p lift course hold holomorphic category difficult compute deformation spaces directly obstruction theories circumvent relating tractable deformation spaces smooth morphism essential starting point mnev universality theorem mathematic algebraic geometry mathematic complex variables > top topics [ t381 ] algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math [ t191 ] space spaces oai:arXiv.org:math/0411469 link to actual OAI record Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 29 The Browser > Repository-specific Browsers • • • • • Library of Congress (http://yarra.calit2.uci.edu/oai/loc/) University of Michigan (http://yarra.calit2.uci.edu/oai/umich/) University of Washington (http://yarra.calit2.uci.edu/oai/uwash/) African Journals Online (http://yarra.calit2.uci.edu/oai/africa/) and many more… Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 30 Clustering, Classification, and Metadata Enhancement Techniques on OAI Records I. II. III. Preprocessing and Topic Modeling The “Browser” Lessons Learned and Next Steps Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 31 Lessons Learned & Next Steps > Evaluation • Topic modeling worked well – – – – Most topics were useful Drain on computer resources was reasonable Human effort was relatively small All repositories processed identically, no special treatment • Strategy worked well – Clustering, then – Classification, and – Broad Topical Categories creation Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 32 Lessons Learned & Next Steps > Further Evaluation • Current processing only for – English-language repositories – Science/research based repositories • Need to test cultural heritage repositories and foreignlanguage records – Less consistent descriptive language and length – “On-the-horse” problem more prevalent – Greater need to individually process repositories • Also need usability testing to evaluate further – Depends on criteria -- who are users? • Librarians? • End-users? – Depends on products and services desired by users Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 33 Lessons Learned & Next Steps > cluster classify classify classify cluster classify classify classify cluster Discussion Point: When to Re-cluster? • Need to re-cluster – when collection changes significantly – if there is a “hole” in topics – but NOT just because you have another repository • If you re-cluster – all topics will be different – have to discard hand-labeling – Broad Topical Categories might be different • However, classification is – “cheap” and easy – e.g., for OAIster, could re-classify every harvest…until spring clean Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 34 Lessons Learned & Next Steps > Products and Services • • • • • Depending on users… What kind of service is useful? What should interface to topics look/act like? What kind of use should we envision? We have some ideas… Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 35 Lessons Learned & Next Steps > Archive of Topics • Are the topics we created useful to anyone else? • Scenario: librarian uses topics/classifier for local resources • To use locally you need: – the preprocessor (i.e. the preprocessing rules) – the vocabulary (file of 90,000 words) – the topic model classifier Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 36 Lessons Learned & Next Steps > Subject Search/Browse for OAIster • Integrate topics into OAIster – add to records so can perform canned topic search – add to interface so can browse BTCs to records • Additionally, can allow users to find records similar to those retrieved – e.g., retrieved records on cosmology and can find similar records on astrophysics, relativity, … • How to do this? Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 37 How To Reach Us • David Newman: University of California, Irvine <newman@uci.edu> • Kat Hagedorn: University of Michigan <khage@umich.edu> • Bill Landis: California Digital Library <bill.landis@ucop.edu> Clustering, Classification, and Metadata Enhancement Techniques on OAI Records 38