Web IR / NLP Group - National University of Singapore

advertisement
SlideSeer: A DL of aligned
document and presentation pairs
Min-Yen Kan
WING (Web IR / NLP Group)
National University of Singapore
Min-Yen Kan, Digital Libraries
Scholarly Digital Libraries: what do we use them for?
• Find articles to print, read offline
• Browse, select research work
• Assess authors, publication venues,
research groups
Papers (documents) don’t store all of the
information about a discovery:
• Datasets
• Tools
• Implementation details / conditions
We’ll focus on this
Web IR / NLP Group @ NUS
They also don’t help a person learn the
research:
• Textbooks
• Slide presentations
20 June 2007 - JCDL: Session E
2
Min-Yen Kan, Digital Libraries
Qualities of slide presentations
Good slide sets complement a document. They often:
• focus and highlight findings in the document
• create a bridge into the document itself
• are a visual and oral summary of a document
How can we leverage slides in a digital library?
What about poor slides?
“ PowerPoint is presenter-oriented, not content-oriented or
audience-oriented…”
The remedy?: “Visual reasoning usually works more
effectively when the relevant evidence is shown adjacent in
space within the eyespan.” (Tufte, 2006)
Four score and seven years ago
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
3
Min-Yen Kan, Digital Libraries
Documents and presentations as duals
Present identical or highly overlapping materials
• Document: for archival and reference purposes
• Presentation: for introducing and summarizing the
work
As the two can be seen as duals, we should allow them
to be viewed together.
–
Would like random access of the presentation and
document pair
Answer: find pairs of documents and presentations.
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
4
Min-Yen Kan, Digital Libraries
A model: MIT’s Open CourseWare
A better answer: add fine-grained alignment.
Audio of lecture
Slides in context
Simplified transcript of
lecture
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
5
Min-Yen Kan, Digital Libraries
Talk Outline
Searc
h
Engin
e
1.
Converters
Resource
cz-ppt2txt cz-ppt2gif
Discovery
pdftohtml
Motivation
Architecture
1. Resource Discovery
2. Alignment
3. User Interface
Demo
Status and Conclusions
Web IR / NLP Group @ NUS
Offline
Resource
discovery
convert
Data
Store
Aligner
2.
Alignment
20 June 2007 - JCDL: Session E
Online
Web
Server
sv
dv
pv
ssv
search
3.
UserJavascri
ptInterface
enabled
browser
6
Min-Yen Kan, Digital Libraries
1. Resource Discovery
Algorithm:
• Obtain suitable document metadata
• Web search to find candidate presentations
• Post process to useable form
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
7
Min-Yen Kan, Digital Libraries
1. Resource Discovery – Obtaining Metadata
Start with CiteSeer (thanks to IST: CL Giles, I Councill)
• 750K records with parsed header metadata
• Complete with .pdf documents
Enhancement: Merge DBLP snapshot (Aug 2006; 1.2M
docs) with CiteSeer
– Large scale record linkage task, O(nm) complexity
unacceptable
– Indexed DBLP into Lucene, use each CS record to retrieve
DBLP variants, resulting in O(n) complexity
– Result size: 1.5M
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
8
Min-Yen Kan, Digital Libraries
1. Resource Discovery – Finding presentations
Google API on title, author to find corresponding
presentation
• Use simple Jaccard similarity threshold to decide
matches
– threshold λ3 for title+author similarity
λ2
CiteSeer
λ1
+ DBLP
merge
DBLP
Lucene
Index
Web IR / NLP Group @ NUS
λ3
Presentations
Web
filetype:
ppt
20 June 2007 - JCDL: Session E
9
Min-Yen Kan, Digital Libraries
1. Resource Discovery – Conversion
Via pdftohtml
- text
- formatted text
Via czppt2gif/convert
- png
- text
Final results:
~85% precision, recall difficult to calculate (~80%)
11K pairs after processing 200K of 1.5M records
Many caveats:
• only .pdf and .ppt formats currently handled
• conversion fails often, pdf conversion difficult
• current work: use OCR to redo text extraction
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
10
Min-Yen Kan, Digital Libraries
2. Alignment – Problem formulation
Q: What are we aligning?
A: Text of slides to document text
– Use paragraphs to delimit text units in documents
– Use document headers to delimit sections
Q: What type of alignment is necessary?
A: Depends. Presentation or document centered view?
– Presentation: 1 slide aligned to 0 to more paragraphs
– Document: 1 section aligned to 0 to more slides
Web IR / NLP Group @ NUS
Text Units
p
Slides
1
1
20 June 2007 - JCDL: Session E
Similarity Matrix
s
Q: What’s the approach?
A: Two stages:
– Basic similarity measure to calculate
a similarity matrix
– Alignment schemes to establish
alignment mapping
Concentrate
on this
11
Min-Yen Kan, Digital Libraries
2. Alignment – Related Work
1. Narration to presentation alignment
– Usually naturally synchronous: Monotonic alignment
2. Multilingual text alignment
– Used in Machine Translation (MT)
– Polynomial complexity (~O(n3)) but heuristics tend to work well
3. Slide/abstract to document alignment
– Use Hidden Markov Model (HMM) for alignment
– Doesn’t handle missing materials well.
Desiderata:
• Should take context into account
• But shouldn’t enforce monotonicity
• Nil (zero) alignments needed, when materials don’t overlap
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
12
Min-Yen Kan, Digital Libraries
2. Alignment – Similarity Measures
Take text units, cut into tokens. Then calculate similarity using:
s p
s  p
2
i i
2
i i
2
i
1. Cosine
2
i i
– Standard IR metric
– TF×IDF for token weight
– Calculate slide, paragraph vector similarity using cosine
s p
s p
2. Jaccard
– unigram tokens
– bigram
– unigram + bigram
– Use IDF weighting for tokens.
For both schemes, use IDF weighting from WebBase corpus
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
13
Min-Yen Kan, Digital Libraries
2. Alignment - Schemes
Using matrix of <p,s> similarity, align using:
1. Max Similarity
– Baseline
– Can’t do nil alignment
2. Edit Distance
– Efficient dynamic programming
– But outputs only monotonic
alignments
3. Local Jump Model
– Variation on #2 to allow local
backward jumps
– Backward jumps within 5% of
text units
– Still doesn’t handle reordered
sections
Web IR / NLP Group @ NUS
4. Hidden Markov Model
– Word-based
– Attempts to find origin of s in p
– Only handles overlapping
information
si-5: …
p6
si-1: …
p5
p3
si: wj-5
p1
wj-1
wj
wj+1
p2
wj+5
p4
si+1: …
p1>p2>p3>p4>p5>p
p6 6
si+5: …
20 June 2007 - JCDL: Session E
14
Min-Yen Kan, Digital Libraries
2. Alignment – Span Extension
As Maximum Similarity does quite well, let’s extend the algorithm
Idea: post-process to extend from points to spans
• Retrieve top n (n=10) most sim paragraphs
• Try all (n2) possible spans for alignment
alignment_score (x,y) = span_sim × ln(span_length)
Slightly favor longer spans
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
15
Min-Yen Kan, Digital Libraries
2. Alignment – Alignment Correction
Neighboring alignments can help to correct a spurious one
p1
pn
si-1
si
si+1
p1
pn
si-1
si
si+1
(a)
p1
pn
si-1
si
si+1
(b)
(c)
• (a) monotonic alignment → ok
• (b) si jumps back from si-1, but then proceeds monotonically
→ probably ok, minor penalty
• (c) si jumps back, but si+1 jumps back forward
→ looks more like an error, major penalty applied
Final alignment score: alignment_score × (1-penalty)
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
16
Min-Yen Kan, Digital Libraries
2. Alignment – Nil classifier
But not all text units should be aligned
Use machine learning (SVM) to learn a binary classifier
Features
1. Similarity score
2. Number of words on slide
Few words can indicate figures, pictures with less preference for
alignment
3. Words on slide
Cue phrases: “outline”, “questions”, “thanks”
4. Alignment path
Jumping alignments (e.g., outline slides)
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
17
Min-Yen Kan, Digital Libraries
2. Alignment – Evaluation Dataset
• Manually compiled alignment dataset by author and
fellow researcher
• Gold standard: annotate all acceptable spans, or nil
20 presentation and document pairs from databases
– Dataset is freely downloadable
Average number of slides in presentation
Average number of paragraphs in document
Average number of nil (zero) alignments
Average number of span alignments (s, x-y)
Average number of point alignments (s, x)
Total
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
37.6
277.3
6.6 (17.4%)
8.8 (23.4%)
22.2 (59.2%)
37.6 (100%)
18
Min-Yen Kan, Digital Libraries
2. Alignment – Evaluation
Alignment Method
Weighted Jaccard Accuracy
1. Max Similarity (cosine)
33.4%
2. Edit Distance (cosine)
28.8%
3. Local Jump (cosine)
25.1%
4. Jing HMM
28.8%
5. Max Sim + spanning (Jaccard bigram)
39.9%
6. Max Sim + spanning + nil classification (Jaccard bigram)
41.2%
40%? Why is it so difficult?
• Noise in conversion process. Other studies have used clean data.
• Other have used soft accuracy (any overlap is correct)
Use Weighted Jaccard accuracy as metric
• Fractional accuracy for partially correct answers
• Give false positives (extra spurious alignments) less weight
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
19
Min-Yen Kan, Digital Libraries
3. User Interface – Rationale
How might fine-grained aligned pairs be utilized in a large DL?
Coordinated Views
• Learning / Comprehension
Collection Interface
• Comparing pairs
• Summarization
• Searching for suitable
materials
• Offline Viewing
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
20
Min-Yen Kan, Digital Libraries
3. UI – Coordinated Views
Gallery
View
Slideshow
View
Slide
View
Print
View
Document
View
Slide centric
Web IR / NLP Group @ NUS
Full
Document
View
Document centric
20 June 2007 - JCDL: Session E
21
SlideSeer Prototype Demo
Production environment
differs from demo
Min-Yen Kan, Digital Libraries
3. UI – Collection Interface
• Searching
–Lucene indexing of the static print view
–Show title along with the set of results
• Spider-friendly
–Main content loaded dynamically by Javascript, not spiderable
–Currently use print view (as it is static) for spiderable interface
• URLs
–Most material in the form <subject/surname/year/title/view/type?offset>
–Implies hierarchy of papers
–Constructed URLs to promote browsing access
• Simple keyboard shortcuts
–For expert user navigation
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
23
Min-Yen Kan, Digital Libraries
Conclusion
• Alignment of documents to presentations
• Simple approach works well thus far
– Tweaks to get more mileage out of simple approach
– Span alignment, nil alignment modifications
– But certainly more models to try!
– 40% best performance, certainly much room to improve
Deployment status
– In Alpha (development)
– Beta hopefully in mid 2008
– Usability testing underway
Web IR / NLP Group @ NUS
Interested in digital anthologies?
• Join our mailing list (web: dAnth)
• Current: text extraction project for
ACL Anthology
20 June 2007 - JCDL: Session E
24
Other slides
Min-Yen Kan, Digital Libraries
Future Work
• Planning to hook up current work in progress
– 2 stage CRF/SVM re-ranking citation segmentation
algorithm
– Automatic keyphrase extraction program
– Automatic synthetic image classification
– Automatic de-duplication module
• Partnering with Simone Teufel (Cambridge U.) to do
argumentative zoning of documents
– What is a citation used for?
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
26
Min-Yen Kan, Digital Libraries
Poor slides
• Often represent a biased view of the full results
– Cherry picking evidence to support claims
– Imply that evidence is independent (when it is statistically
correlated)
– May summarize other findings inaccurately (secondary or
tertiary sources
Web IR / NLP Group @ NUS
20 June 2007 - JCDL: Session E
27
Download