SlideSeer: A DL of aligned document and presentation pairs Min-Yen Kan WING (Web IR / NLP Group) National University of Singapore Min-Yen Kan, Digital Libraries Scholarly Digital Libraries: what do we use them for? • Find articles to print, read offline • Browse, select research work • Assess authors, publication venues, research groups Papers (documents) don’t store all of the information about a discovery: • Datasets • Tools • Implementation details / conditions We’ll focus on this Web IR / NLP Group @ NUS They also don’t help a person learn the research: • Textbooks • Slide presentations 20 June 2007 - JCDL: Session E 2 Min-Yen Kan, Digital Libraries Qualities of slide presentations Good slide sets complement a document. They often: • focus and highlight findings in the document • create a bridge into the document itself • are a visual and oral summary of a document How can we leverage slides in a digital library? What about poor slides? “ PowerPoint is presenter-oriented, not content-oriented or audience-oriented…” The remedy?: “Visual reasoning usually works more effectively when the relevant evidence is shown adjacent in space within the eyespan.” (Tufte, 2006) Four score and seven years ago Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 3 Min-Yen Kan, Digital Libraries Documents and presentations as duals Present identical or highly overlapping materials • Document: for archival and reference purposes • Presentation: for introducing and summarizing the work As the two can be seen as duals, we should allow them to be viewed together. – Would like random access of the presentation and document pair Answer: find pairs of documents and presentations. Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 4 Min-Yen Kan, Digital Libraries A model: MIT’s Open CourseWare A better answer: add fine-grained alignment. Audio of lecture Slides in context Simplified transcript of lecture Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 5 Min-Yen Kan, Digital Libraries Talk Outline Searc h Engin e 1. Converters Resource cz-ppt2txt cz-ppt2gif Discovery pdftohtml Motivation Architecture 1. Resource Discovery 2. Alignment 3. User Interface Demo Status and Conclusions Web IR / NLP Group @ NUS Offline Resource discovery convert Data Store Aligner 2. Alignment 20 June 2007 - JCDL: Session E Online Web Server sv dv pv ssv search 3. UserJavascri ptInterface enabled browser 6 Min-Yen Kan, Digital Libraries 1. Resource Discovery Algorithm: • Obtain suitable document metadata • Web search to find candidate presentations • Post process to useable form Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 7 Min-Yen Kan, Digital Libraries 1. Resource Discovery – Obtaining Metadata Start with CiteSeer (thanks to IST: CL Giles, I Councill) • 750K records with parsed header metadata • Complete with .pdf documents Enhancement: Merge DBLP snapshot (Aug 2006; 1.2M docs) with CiteSeer – Large scale record linkage task, O(nm) complexity unacceptable – Indexed DBLP into Lucene, use each CS record to retrieve DBLP variants, resulting in O(n) complexity – Result size: 1.5M Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 8 Min-Yen Kan, Digital Libraries 1. Resource Discovery – Finding presentations Google API on title, author to find corresponding presentation • Use simple Jaccard similarity threshold to decide matches – threshold λ3 for title+author similarity λ2 CiteSeer λ1 + DBLP merge DBLP Lucene Index Web IR / NLP Group @ NUS λ3 Presentations Web filetype: ppt 20 June 2007 - JCDL: Session E 9 Min-Yen Kan, Digital Libraries 1. Resource Discovery – Conversion Via pdftohtml - text - formatted text Via czppt2gif/convert - png - text Final results: ~85% precision, recall difficult to calculate (~80%) 11K pairs after processing 200K of 1.5M records Many caveats: • only .pdf and .ppt formats currently handled • conversion fails often, pdf conversion difficult • current work: use OCR to redo text extraction Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 10 Min-Yen Kan, Digital Libraries 2. Alignment – Problem formulation Q: What are we aligning? A: Text of slides to document text – Use paragraphs to delimit text units in documents – Use document headers to delimit sections Q: What type of alignment is necessary? A: Depends. Presentation or document centered view? – Presentation: 1 slide aligned to 0 to more paragraphs – Document: 1 section aligned to 0 to more slides Web IR / NLP Group @ NUS Text Units p Slides 1 1 20 June 2007 - JCDL: Session E Similarity Matrix s Q: What’s the approach? A: Two stages: – Basic similarity measure to calculate a similarity matrix – Alignment schemes to establish alignment mapping Concentrate on this 11 Min-Yen Kan, Digital Libraries 2. Alignment – Related Work 1. Narration to presentation alignment – Usually naturally synchronous: Monotonic alignment 2. Multilingual text alignment – Used in Machine Translation (MT) – Polynomial complexity (~O(n3)) but heuristics tend to work well 3. Slide/abstract to document alignment – Use Hidden Markov Model (HMM) for alignment – Doesn’t handle missing materials well. Desiderata: • Should take context into account • But shouldn’t enforce monotonicity • Nil (zero) alignments needed, when materials don’t overlap Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 12 Min-Yen Kan, Digital Libraries 2. Alignment – Similarity Measures Take text units, cut into tokens. Then calculate similarity using: s p s p 2 i i 2 i i 2 i 1. Cosine 2 i i – Standard IR metric – TF×IDF for token weight – Calculate slide, paragraph vector similarity using cosine s p s p 2. Jaccard – unigram tokens – bigram – unigram + bigram – Use IDF weighting for tokens. For both schemes, use IDF weighting from WebBase corpus Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 13 Min-Yen Kan, Digital Libraries 2. Alignment - Schemes Using matrix of <p,s> similarity, align using: 1. Max Similarity – Baseline – Can’t do nil alignment 2. Edit Distance – Efficient dynamic programming – But outputs only monotonic alignments 3. Local Jump Model – Variation on #2 to allow local backward jumps – Backward jumps within 5% of text units – Still doesn’t handle reordered sections Web IR / NLP Group @ NUS 4. Hidden Markov Model – Word-based – Attempts to find origin of s in p – Only handles overlapping information si-5: … p6 si-1: … p5 p3 si: wj-5 p1 wj-1 wj wj+1 p2 wj+5 p4 si+1: … p1>p2>p3>p4>p5>p p6 6 si+5: … 20 June 2007 - JCDL: Session E 14 Min-Yen Kan, Digital Libraries 2. Alignment – Span Extension As Maximum Similarity does quite well, let’s extend the algorithm Idea: post-process to extend from points to spans • Retrieve top n (n=10) most sim paragraphs • Try all (n2) possible spans for alignment alignment_score (x,y) = span_sim × ln(span_length) Slightly favor longer spans Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 15 Min-Yen Kan, Digital Libraries 2. Alignment – Alignment Correction Neighboring alignments can help to correct a spurious one p1 pn si-1 si si+1 p1 pn si-1 si si+1 (a) p1 pn si-1 si si+1 (b) (c) • (a) monotonic alignment → ok • (b) si jumps back from si-1, but then proceeds monotonically → probably ok, minor penalty • (c) si jumps back, but si+1 jumps back forward → looks more like an error, major penalty applied Final alignment score: alignment_score × (1-penalty) Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 16 Min-Yen Kan, Digital Libraries 2. Alignment – Nil classifier But not all text units should be aligned Use machine learning (SVM) to learn a binary classifier Features 1. Similarity score 2. Number of words on slide Few words can indicate figures, pictures with less preference for alignment 3. Words on slide Cue phrases: “outline”, “questions”, “thanks” 4. Alignment path Jumping alignments (e.g., outline slides) Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 17 Min-Yen Kan, Digital Libraries 2. Alignment – Evaluation Dataset • Manually compiled alignment dataset by author and fellow researcher • Gold standard: annotate all acceptable spans, or nil 20 presentation and document pairs from databases – Dataset is freely downloadable Average number of slides in presentation Average number of paragraphs in document Average number of nil (zero) alignments Average number of span alignments (s, x-y) Average number of point alignments (s, x) Total Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 37.6 277.3 6.6 (17.4%) 8.8 (23.4%) 22.2 (59.2%) 37.6 (100%) 18 Min-Yen Kan, Digital Libraries 2. Alignment – Evaluation Alignment Method Weighted Jaccard Accuracy 1. Max Similarity (cosine) 33.4% 2. Edit Distance (cosine) 28.8% 3. Local Jump (cosine) 25.1% 4. Jing HMM 28.8% 5. Max Sim + spanning (Jaccard bigram) 39.9% 6. Max Sim + spanning + nil classification (Jaccard bigram) 41.2% 40%? Why is it so difficult? • Noise in conversion process. Other studies have used clean data. • Other have used soft accuracy (any overlap is correct) Use Weighted Jaccard accuracy as metric • Fractional accuracy for partially correct answers • Give false positives (extra spurious alignments) less weight Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 19 Min-Yen Kan, Digital Libraries 3. User Interface – Rationale How might fine-grained aligned pairs be utilized in a large DL? Coordinated Views • Learning / Comprehension Collection Interface • Comparing pairs • Summarization • Searching for suitable materials • Offline Viewing Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 20 Min-Yen Kan, Digital Libraries 3. UI – Coordinated Views Gallery View Slideshow View Slide View Print View Document View Slide centric Web IR / NLP Group @ NUS Full Document View Document centric 20 June 2007 - JCDL: Session E 21 SlideSeer Prototype Demo Production environment differs from demo Min-Yen Kan, Digital Libraries 3. UI – Collection Interface • Searching –Lucene indexing of the static print view –Show title along with the set of results • Spider-friendly –Main content loaded dynamically by Javascript, not spiderable –Currently use print view (as it is static) for spiderable interface • URLs –Most material in the form <subject/surname/year/title/view/type?offset> –Implies hierarchy of papers –Constructed URLs to promote browsing access • Simple keyboard shortcuts –For expert user navigation Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 23 Min-Yen Kan, Digital Libraries Conclusion • Alignment of documents to presentations • Simple approach works well thus far – Tweaks to get more mileage out of simple approach – Span alignment, nil alignment modifications – But certainly more models to try! – 40% best performance, certainly much room to improve Deployment status – In Alpha (development) – Beta hopefully in mid 2008 – Usability testing underway Web IR / NLP Group @ NUS Interested in digital anthologies? • Join our mailing list (web: dAnth) • Current: text extraction project for ACL Anthology 20 June 2007 - JCDL: Session E 24 Other slides Min-Yen Kan, Digital Libraries Future Work • Planning to hook up current work in progress – 2 stage CRF/SVM re-ranking citation segmentation algorithm – Automatic keyphrase extraction program – Automatic synthetic image classification – Automatic de-duplication module • Partnering with Simone Teufel (Cambridge U.) to do argumentative zoning of documents – What is a citation used for? Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 26 Min-Yen Kan, Digital Libraries Poor slides • Often represent a biased view of the full results – Cherry picking evidence to support claims – Imply that evidence is independent (when it is statistically correlated) – May summarize other findings inaccurately (secondary or tertiary sources Web IR / NLP Group @ NUS 20 June 2007 - JCDL: Session E 27