cs751-s11_mklein - ODU Computer Science

advertisement
Synchronicity
Real Time Recovery of
Missing Web Pages
Martin Klein
mklein@cs.odu.edu
Introduction to Digital Libraries
Week 14
CS 751 Spring 2011
04/12/2011
Who are you again?
• Ph.D. student w/ MLN since 2005
• Diagnostic exam in 2006, dissertation proposal
in 2008
• 17 publications to date
• Outstanding RA award CS dept
• CoS dissertation fellowship
• 3 ACM SIGWEB + 2 misc travel grants
• CS595 (S10) & CS518 (F10)
2
The Problem
http://www.jcdl2007.org
http://www.jcdl2007.org/JCDL2007_Program.pdf
3
The Problem
• Web users experience 404 errors
• expected lifetime of a web page is 44 days [Kahle97]
• 2% of web disappears every week [Fetterly03]
• Are they really gone? Or just relocated?
• has anybody crawled and indexed it?
• do Google, Yahoo!, Bing or the IA have a copy of
that page?
• Information retrieval techniques needed to
(re-)discover content
4
The Environment
Web Infrastructure (WI) [McCown07]
• Web search engines (Google, Yahoo!, Bing)
and their caches
• Web archives (Internet Archive)
• Research projects (CiteSeer)
5
Refreshing and Migration in the WI
Digital preservation happens in the WI
Google Scholar
CiteSeerX
Internet Archive
http://waybackmachine.org/*/http:/techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf
6
URI – Content Mapping Problem
1
U1
U1
C1
C1
A
time
B
U1
404
3
U1
U2
C1
C1
A
time
same URI maps
to same or very
similar content
at a later time
2
different URI
maps to same or
very similar
content at the
4
same or at a
later time
U1
U1
C1
C2
A
time
B
U1
U1
C1
???
A
time
same URI
maps to
different
content at a
later time
the content
can not be
found at any
URI
B
B
7
Content Similarity
JCDL 2005
http://www.jcdl2005.org/
July 2005
http://www.jcdl2005.org/
Today
8
Content Similarity
Hypertext 2006
http://www.ht06.org/
August 2006
http://www.ht06.org/
Today
9
Content Similarity
PSP 2003
http://www.pspcentral.org/events/annual_meeting_2003.html
http://www.pspcentral.org/events/archive/annual_meeting_2003.html
August 2003
Today
10
Content Similarity
ECDL 1999
http://www-rocq.inria.fr/EuroDL99/
http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html
October 1999
Today
11
Content Similarity
Greynet 1999
http://www.konbib.nl/infolev/greynet/2.5.htm
1999
?
Today
?
12
Lexical Signatures (LSs)
• First introduced by Phelps and Wilensky [Phelps00]
• Small set of terms capturing “aboutness” of a
document, “lightweight” metadata
Resource
Abstract
LS
Removal Google
Hit
Yahoo
Rate
Bing
Proxy
Cache
13
Generation of Lexical Signatures
•
•
Following TF-IDF scheme first introduced by
Spaerck Jones and Robertson [Jones88]
Term frequency (TF):
– “How often does this word appear in this
document?”
•
Inverse document frequency (IDF):
– “In how many documents does this word appear?”
14
LS as Proposed by Phelps and Wilensky
• “Robust Hyperlink”
• 5 terms are suitable
• Append LS to URL
http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=
texttiling+wilensky+disambiguation+subtopic+iago
• Limitations:
1. Applications (browsers) need to be modified to
exploit LSs
2. LSs need to be computed a priori
3. Works well with most URLs but not with all of
them
15
Generation of Lexical Signatures
• Park et al. [Park03] investigated performance of
various LS generation algorithms
• Evaluated “tunability” of TF and IDF
component
•
•
Weight on TF increases recall (completeness)
Weight on IDF improves precision (exactness)
16
Lexical Signatures -- Examples
Rank/Results
1/243
URL
LS
http://endeavour.cs.berkeley.edu/
endeavour 94720-1776
achieve inter-endeavour
amplifies
Search
1/1,930
http://www.jcdl2005.org
jcdl2005 libraries conference
cyberinfrastructure jcdl
Search
1/25,900
http://www.loc.gov
celebrate knowledge webcasts
kluge library
Search
17
Synchronicity
404 error occurs while browsing
look for same or older page in WI (1)
if user satisfied
return page   (2)
else 
generate LS from retrieved page (3)
query SEs with LS
if result sufficient
return “good enough” alternative page   (4)
else 
get more input about desired content (5)
(link neighborhood, user input,...)
re-generate LS && query SEs ...
return pages   (6)
The system may not return any results at all 
18
Synchro…What?
Synchronicity
• Experience of causally unrelated events
occurring together in a meaningful manner
• Events reveal underlying pattern, framework
bigger than any of the synchronous systems
• Carl Gustav Jung (1875-1961)
•
“meaningful coincidence”
• Deschamps – de Fontgibu plum
pudding example
19
picture from http://www.crystalinks.com/jung.html
404 Errors
20
404 Errors
21
“Soft 404” Errors
22
“Soft 404” Errors
23
A Comparison of Techniques for
Estimating IDF Values to Generate
Lexical Signatures for the Web
(WIDM 2008)
The Problem
• LSs are usually generated following the TF-IDF
scheme
• TF rather trivial to compute
• IDF requires knowledge about:
• overall size of the corpus (# of documents)
• # of documents a term occurs in
• Not complicated to compute for bounded
corpora (such as TREC)
• If the web is the corpus, values can only be
estimated
25
The Idea
• Use IDF values obtained from
1. Local collection of web pages
2. ``screen scraping‘‘ SE result pages
• Validate both methods through comparison to
baseline
• Use Google N-Grams as baseline
• Note: N-Grams provide term count (TC)
and not DF values – details to come
26
Accurate IDF Values for LSs
Screen scraping the Google web interface
27
The Dataset
Local universe consisting of copies of URLs from the IA
between 1996 and 2007
28
The Dataset
Same as above, follows Zipf distribution
10,493 observations
254,384 total terms
16,791 unique terms
29
The Dataset
Total terms vs new terms
30
LSs Example
Based on all 3 methods
URL: http://www.perfect10wines.com
Year: 2007
Union: 12 unique terms
31
Comparing LSs
1. Normalized term overlap
• Assume term commutativity
• k-term LSs normalized by k
2. Kendall Tau
• Modified version since LSs to compare
may contain different terms
3. M-Score
• Penalizes discordance in higher ranks
32
Comparing LSs
Top 5, 10 and 15
terms
LC – local universe
SC – screen scraping
NG – N-Grams
33
Conclusions
• Both methods for the computation of IDF
values provide accurate results
• compared to the Google N-Gram baseline
• Screen scraping method seems preferable
since
• similaity scores slightly higher
• feasible in real time
34
Correlation of Term Count and Document
Frequency for Google N-Grams
(ECIR 2009)
The Problem
• Need of a reliable source to accurately
compute IDF values of web pages (in real time)
• Shown, screen scraping works but
• missing validation of baseline (Google NGrams)
• N-Grams seem suitable (recently created,
based on web pages) but provide TC and not DF
 what is their relationship?
36
Background & Motivation
• Term frequency (TF) – inverse document frequency (IDF) is a well known
term weighting concept
• Used (among others) to generate lexical signatures (LSs)
• TF is not hard to compute, IDF is since it depends on global knowledge
about the corpus
 When the entire web is the corpus IDF can only be estimated!
• Most text corpora provide term count values (TC)
D1 = “Please, Please Me”
D3 = “All You Need Is Love”
D2 = “Can’t Buy Me Love”
D4 = “Long, Long, Long”
Term
All
Buy
Can’t
Is
Love
Me
Need
Please
You
Long
TC
1
1
1
1
2
2
1
2
1
3
DF
1
1
1
1
2
2
1
1
1
1
TC >= DF but is there a correlation? Can we use TC to estimate DF?
37
The Idea
• Investigate relationship between:
• TC and DF within the Web as Corpus (WaC)
• WaC based TC and Google N-Gram based
TC
• TREC, BNC could be used but:
• they are not free
• TREC has been shown to be somewhat
dated
[Chiang05 ]
38
The Experiment
• Analyze correlation of list of terms ordered by
their TC and DF rank by computing:
• Spearman‘s Rho
• Kendall Tau
• Display frequency of TC/DF ratio for all terms
• Compare TC (WaC) and TC (N-Grams)
frequencies
39
Experiment Results
Investigate correlation between TC and DF
within “Web as Corpus” (WaC)
Rank similarity of all terms
40
Experiment Results
Investigate correlation between TC and DF
within “Web as Corpus” (WaC)
Spearman’s ρ and Kendall τ
41
Experiment Results
Top 10 terms in decreasing order of their TF/IDF values
taken from http://ecir09.irit.fr
Rank
WaC-DF
WaC-TC
Google
N-Grams
1
IR
IR
IR
IR
2
RETRIEVAL
RETRIEVAL
RETRIEVAL
IRSG
3
IRSG
IRSG
IRSG
RETRIEVAL
4
BCS
IRIT
CONFERENCE
BCS
5
IRIT
BCS
BCS
EUROPEAN
6
CONFERENCE
2009
GRANT
CONFERENCE
7
GOOGLE
FILTERING
IRIT
IRIT
8
2009
GOOGLE
FILTERING
GOOGLE
9
FILTERING
CONFERENCE
EUROPEAN
ACM
10
GRANT
ARIA
PAPERS
GRANT
U = 14
∩=6
Strong indicator
that TC can be
used to estimate
DF for web pages!
Google: screen scraping DF values from the Google web interface
42
Experiment Results
Frequency of TC/DF Ratio Within the WaC
Two Decimals
One Decimal
Integer Values
43
Experiment Results
Show similarity between WaC based TC and
Google N-Gram based TC
TC frequencies
44
N-Grams have a threshold of 200
Conclusions
• TC and DF Ranks within the WaC show strong
correlation
• TC frequencies of WaC and Google N-Grams
are very similiar
• Together with results shown earlier (high
correlation between baseline and two other
methods) N-Grams seem suitable for accurate
IDF estimation for web pages
 Does not mean everything correlated to TC
can be used as DF substitude!
45
Inter-Search Engine
Lexical Signature Performance
(JCDL 2009)
Inter-Search Engine
Lexical Signature Performance
Martin Klein
Michael L. Nelson
{mklein,mln}@cs.odu.edu
Elephant
Tusks
Trunk
African
Loxodonta
http://en.wikipedia.org/wiki/Elephant
Elephant, African, Tusks
Asian, Trunk
Elephant, Asian, African
Species, Trunk
48
Revisiting Lexical Signatures to
(Re-)Discover Web Pages
(ECDL 2008)
How to Evaluate the Evolution of LSs over Time
Idea:
• Conduct overlap analysis of LSs generated
over time
• LSs based on local universe mentioned above
• Neither Phelps and Wilensky nor Park et al.
did that
• Park et al. just re-confirmed their findings after 6
month
50
Dataset
Local universe consisting of copies of URLs from the IA
between 1996 and 2007
51
LSs Over Time - Example
10-term LSs generated for
http://www.perfect10wines.com
52
LS Overlap Analysis
Rooted:
overlap between the
LS of the year of the
first observation in the
IA and all LSs of the
consecutive years that
URL has been
observed
Sliding:
overlap between
two LSs of
consecutive years
starting with the
first year and
ending with the
last
53
Evolution of LSs over Time
Rooted
Results:
• Little overlap between the early years and more recent ones
• Highest overlap in the first 1-2 years after creation of the LS
• Rarely peaks after that – once terms are gone do not return
54
Evolution of LSs over Time
Sliding
Results:
• Overlap increases over time
• Seem to reach steady state around 2003
55
Performance of LSs
Idea:
• Query Google search API with LSs
• LSs based on local universe mentioned above
• Identify URL in result set
• For each URL it is possible that:
1.
2.
3.
4.
URL is returned as the top ranked result
URL is ranked somewhere between 2 and 10
URL is ranked somewhere between 11 and 100
URL is ranked somewhere beyond rank 100
 considered as not returned
56
Performance of LSs wrt Number of Terms
Results:
• 2-, 3- and 4-term LSs perform poorly
• 5-, 6- and 7-term LSs seem best
• Top mean rank (MR) value with 5 terms
• Most top ranked with 7 terms
• Binary pattern: either in top 10 or undiscovered
• 8 terms and beyond do not show improvement
57
Performance
- Number
ofNumber
Termsof Terms
Performance
of LSs wrt
Rank distribution of 5 term LSs
• Lightest gray =
rank 1
• Black = rank 101
and beyond
• Ranks 11-20, 2130,… colored
proportionally
• 50% top ranked,
20% in top 10,
30% black
58
Performance of LSs
Scoring:
• normalized Discounted Cumulative Gain (nDCG)
• Binary relevance: 1 for match, 0 otherwise
59
Performance of LSs wrt Number of Terms
nDCG for LSs consisting of 2-15 terms
(mean over all years)
60
Performance of LSs over Time
Score for LSs consisting of 2, 5, 7 and 10 terms
61
Conclusions
• LSs decay over time
• Rooted: quickly after generation
• Sliding: seem to stabilize
• 5-, 6- and 7-term LSs seem to perform best
• 7 – most top ranked
• 5 – fewest undiscovered
• 5 – lowest mean rank
• 2..4 as well as 8+ terms insufficient
62
Evaluating Methods to Rediscover
Missing Web Pages from the
Web Infrastructure
(JCDL 2010)
The
TheProblem
Problem
Internet Archive Wayback Machine
www.aircharter-international.com
http://web.archive.org/web/*/http://www.aircharterinternational.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
64
International
59 copies
64
The
TheProblem
Problem
www.aircharter-international.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
65
65
The
TheProblem
Problem
www.aircharter-international.com
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
International
66
The
TheProblem
Problem
If no archived/cached copy can be found...
Link
Neighborhood
(LNLS)
Tags
A
?
B
67
C
The
TheProblem
Problem
68
Contributions
Contributions
• Compare performance of four automated
methods to rediscover web pages
1. Lexical signatures (LSs)
3. Tags
2. Titles
4. LNLS
• Analysis of title characteristics wrt their retrieval
performance
• Evaluate performance of combination of
methods and suggest workflow for real time
web page rediscovery
69
Experiment - Data
Gathering
Data
Gathering
• 500 URIs randomly sampled from DMOZ
• Applied filters
– .com, .org, .net, .edu domains
– English Language
– min. of 50 terms [Park]
• Results in 309 URIs to download and parse
70
Experiment - Data
Gathering
Data
Gathering
• Extract title
– <Title>...</Title>
• Generate 3 LSs per page
– IDF values obtained from Google, Yahoo!, MSN Live
• Obtain tags from delicious.com API (only 15%)
• Obtain link neighborhood from Yahoo! API
(max. 50 URIs)
– Generate LNLS
– TF from “bucket” of words per neighborhood
71
– IDF obtained from Yahoo! API
LS RetrievalLSPerformance
Retrieval Performance
5- and 7-Term LSs
•
•
72
Yahoo! returns
most URIs top
ranked and leaves
least undiscovered
Binary retrieval
pattern, URI either
within top 10 or
undiscovered
Title Retrieval
TitlePerformance
Retrieval Performance
Non-Quoted and Quoted Titles
•
•
•
73
Results at least as
good as for LSs
Google and Yahoo!
return more URIs
for non-quoted
titles
Same binary
retrieval pattern
Tags Retrieval
Tags Performance
Retrieval Performance
•
•
•
74
API returns up to
top10 tags distinguish
between # of tags
queried
Low # of URIs
More later…
LNLS Retrieval
Performance
LNLS Retrieval
Performance
•
•
•
75
5- and 7-term LNLSs
< 5% top ranked
More later…
CombinationCombination
of Methods
of Methods
Can we achieve better retrieval performance if we
combine 2 or more methods?
Query LS
Done
Query Title
Done
Query Tags
Done
76
Query LNLS
CombinationCombination
of Methods
of Methods
LS5
LS7
TI
TA
Top
50.8
57.3
69.3
2.1
Top10
12.6
9.1
8.1
10.6
Undis
32.4
31.1
19.7
75.5
Yahoo!
LS5
LS7
TI
TA
77
Top
63.1
62.8
61.5
0
Top10
8.1
5.8
6.8
8.5
Undis
27.2
29.8
30.7
80.9
Google
LS5
LS7
TI
TA
Top
67.6
66.7
63.8
6.4
MSN Live
Top10
7.8
4.5
8.1
17.0
Undis
22.3
26.9
27.5
63.8
CombinationCombination
of Methods
of Methods
Top Results for Combination of Methods
LS5-TI
LS7-TI
TI-LS5
TI-LS7
LS5-TI-LS7
LS7-TI-LS5
TI-LS5-LS7
TI-LS7-LS5
LS5-LS7
LS7-LS5
78
Google
65.0
70.9
73.5
74.1
65.4
71.2
73.8
74.4
52.8
59.9
Yahoo!
73.8
75.7
75.7
75.1
73.8
76.4
75.7
75.7
68.0
71.5
MSN Live
71.5
73.8
73.1
74.1
72.5
74.4
74.1
74.8
64.4
66.7
Title Characteristics
Title Characteristics
Length in # of Terms
•
•
79
Length varies
between 1 and 43
terms
Length between 3
and 6 terms occurs
most frequently and
performs well [Ntoulas]
Title Characteristics
Title Characteristics
Length in # of Characters
•
•
•
•
80
Length varies
between 4 and 294
characters
Short titles (<10) do
not perform well
Length between 10
and 70 most
common
Length between 10
and 45 seem to
perform best
Title Characteristics
Title Characteristics
Mean # of Characters, # of Stop Words
•
•
81
Title terms with a
mean of 5,6,7
characters seem
most suitable for
well performing
terms
More than 1 or 2
stop words hurts
performance
Concluding Remarks
Conclusions
Lexical signatures, as much as titles, are very suitable as
search engine queries to rediscover missing web pages. They
return 50-70% URIs top ranked.
Tags and link neighborhood LSs do not seem to significantly
contribute to the retrieval of the web pages.
Titles are much cheaper to obtain than LSs.
The combination of primarily querying titles and 5-term LSs as
a second option returns more than 75% URIs top ranked.
Not all titles are equally good.
Titles containing between 3 and 6 terms seem to perform
best. More than a couple of stop words hurt the performance.
82
Is This a Good Title?
(Hypertext 2010)
The
TheProblem
Problem
www.aircharter-international.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
86
86
The
TheProblem
Problem
www.aircharter-international.com
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
International
87
The
TheProblem
Problem
http://www.drbartell.com/
Lexical Signature
(TF/IDF)
Plastic Surgeon
Reconstructive Dr
Bartell Symbol
University
88
???
The
TheProblem
Problem
http://www.drbartell.com/
Title
Thomas Bartell
MD BoardCertified Cosmetic Plastic
Reconstructive
Surgery
89
The
TheProblem
Problem
www.reagan.navy.mil
Lexical Signature
(TF/IDF)
Ronald USS MCSN
Torrey Naval Sea
Commanding
90
90
The
TheProblem
Problem
www.reagan.navy.mil
Title
Home Page
Is This a
Good Title?
91
???
Contributions
Contributions
• Discuss discovery performance of web
pages titles (compared to LSs)
• Analysis of discovered pages regarding
their relevancy
• Display title evolution compared to
content evolution over time
• Provide prediction model for title’s
retrieval potential
92
Experiment - Data
Gathering
Data
Gathering
• 20k URIs randomly sampled from DMOZ
• Applied filters
– English language
– min. of 50 terms
• Results in 6,875 URIs
• Downloaded and parsed the pages
• Extract title and generate LS per page (baseline)
93
Original
Filtered
.com
15289
4863
.org
2755
1327
.net
1459
369
.edu
497
316
sum
20000
6875
Title (and
LS)and
Retrieval
Performance
Title
LS Retrieval
Performance
Titles
•
•
94
5- and 7-Term LSs
Titles return more than 60% URIs top ranked
Binary retrieval pattern, URI either within top 10 or
undiscovered
RelevancyRelevancy
of Retrieval
Results
of Retrieval
Results
Do titles return relevant
results besides the original
URI?
Distinguish between
discovered (top 10) and
undiscovered URIs
•
•
•
Analyze content of top 10
results
Measure relevancy in terms
of normalized term overlap
and shingles between
original URI and search
95result by rank
???
RelevancyRelevancy
of Retrieval
Results
of Retrieval
Results
Discovered
96
Term Overlap
Undiscovered
High relevancy in the top ranks
with possible aliases and duplicates.
RelevancyRelevancy
of Retrieval
Results
of Retrieval
Results
Discovered
Shingles
Undiscovered
More optimal shingles values than top ranked URIs possible
aliases and duplicates.
97
Title Evolution
Example
I
Title-Evolution
– Example
I
www.sun.com/solutions
1998-01-27
Sun Software Products Selector Guides
- Solutions Tree
1999-02-20
Sun Software Solutions
2002-02-01
Sun Microsystems Products
2002-06-01
Sun Microsystems - Business & Industry
Solutions
2003-08-01
Sun Microsystems - Industry &
Infrastructure Solutions Sun Solutions
98
2004-02-02
Sun Microsystems – Solutions
2004-06-10
Gateway Page - Sun Solutions
2006-01-09
Sun Microsystems Solutions & Services
2007-01-03
Services & Solutions
2007-02-07
Sun Services & Solutions
2008-01-19
Sun Solutions
Title Evolution
Example
II
Title -Evolution
– Example
II
www.datacity.com/mainf.html
2000-06-19
DataCity of Manassas Park Main Page
2000-10-12
DataCity of Manassas Park sells
Custom Built Computers & Removable
Hard Drives
2001-08-21
DataCity a computer company in
Manassas Park sells Custom Built
Computers & Removable Hard Drives
99
2002-10-16
computer company in Manassas Virginia
sells Custom Built Computers with
Removable Hard Drives Kits and Iomega
2GB Jaz Drives (jazz drives) October
2002 DataCity 800-326-5051 toll free
2006-03-14
Est 1989 Computer company in Stafford
Virginia sells Custom Built Secure
Computers with DoD 5200.1-R Approved
Removable Hard Drives, Hard Drive Kits
and Iomega 2GB Jaz Drives (jazz
drives), introduces the IllumiNite; lighted
keyboard DataCity 800-326-5051
Service Disabled Veteran Owned
Business SDVOB
Title Evolution
Time
TitleOver
Evolution
Over Time
How much do titles change
over time?
•
•
•
Copies from fixed size time
windows per year
Extract available titles of past
14 years
Compute normalized
Levenshtein edit distance
between titles of copies and
baseline
(0 = identical; 1 = completely
dissimilar)
100
Title Evolution
Time
TitleOver
Evolution
Over Time
Title edit distance
frequencies
•
•
•
Half the titles of
available copies
from recent years
are (close to)
identical
Decay from 2005 on
(with fewer copies
available)
4 year old title:
40% chance to be
101
unchanged
Title Evolution
Time
TitleOver
Evolution
Over Time
•
•
•
•
Title vs Document
Y: avg shingle value
for all copies per URI
[0,1] - over 1600 times
X: avg edit distance
of corresponding
titles
overlap indicated by:
green: <10
red: >90
Semi-transparent:
total amount of points
plotted
102
[0,0] - 122 times
Title Performance
Prediction
Title Performance
Prediction
•
•
Quality prediction of title by
•
•
Number of nouns, articles etc.
Amount of title terms, characters ([Ntoulas])
Observation of re-occurring terms in poorly performing
titles - “Stop Titles”
home, index, home page, welcome, untitled document
The performance of any given title can be
predicted as insufficient if it consists to 75%
or more of a “Stop Title”!
[Ntoulas]
103
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92
Concluding Remarks
Conclusions
The “aboutness” of web pages can be determined from
either the content or from the title.
More than 60% of URIs are returned top ranked when
using the title as a search engine query.
Titles change more slowly and less significantly over time
than the web pages’ content.
Not all titles are equally good.
If the majority of title terms are Stop Titles its quality can
be predicted poor.
104
Find, New, Copy, Web, Page Tagging for the (Re-)Discovery of Web Pages
(submitted for publication)
The
TheProblem
Problem
We have seen that we have a good chance
to rediscover missing pages with
• Lexical signatures
• Titles
BUT
What if no archived/cached copy can be
found?
106
The
TheProblem
Solution?
Conferences
Digitallibraries
Conference
Library
Jcdl2005
Search
107
The
The Problem
Questions
• What is a good length for a tag based
query string?
• 5 or 7 tags like lexical signatures?
• Can we improve retrieval performance
when combining tags w/ title- and/or lexical
signature-based queries?
• Do tags contain information about a page
that is not in the title/content?
108
The
Problem
The Experiment
• URIs with tags rather sparse in previously
created corpora
• Creation of new, tag centered corpus
• query Delicious for 5k unique URIs
• eventually obtain:
• 4,968 URIs
• 11 duplicates
• 21 URIs w/o tags
109
The
Problem
The Experiment
• Tags queried against the Yahoo! BOSS API
• Same four retrieval cases introduced earlier
• nDCG w/ same relevance scoring
• Mean Average Precision
110
The
Problem
The Experiment
• JaroWinkler distance between URIs
• Dice similarity between contents
111
The
Problem
The Experiment
Combining methods
112
The
Problem
The Experiment
• Fact:
• ~50% of tags do not occur in page
• “Secret”:
• ~50% of tags do not occur in current
version of page
• ergo: How about previous versions?
113
The
Problem
Ghost
Tags
• 3,306 URIs w/ older copies
• 66.3% of our tags do not occur in page
• 4.9% of tags occur in previous version of
page – Ghost Tags
• represent a previous version better than
the current one
• But what kind of tags are these?
• Are they important to the document? To
the Delicious user?
114
The
Problem
Ghost
Tags
Document importance:
TF rank
User importance:
Delicious rank
Normalized rank:
0 - top
1 - bottom
115
Concluding Remarks
Conclusions
Tags can be used for search!
We can improve the retrieval performance by combining
tags based search with titles and lexical signatures.
Ghost Tags exist! One out of three important terms better
describes a previous than the current version of a page.
How old are Ghost Tags?
When do tags “ghostify”? Wrt importance/change of page?
116
Rediscovering Missing Web Pages Using Link
Neighborhood Lexical Signatures
(JCDL 2011)
The
TheProblem
Problem
We have seen that we have a good chance
to rediscover missing pages with
• Lexical signatures
• Titles
BUT
What if no archived/cached copy can be
found?
Plan A: Tags
118
The
TheProblem
Solution?
Plan B: Link neighborhood Lexical Signatures
119
The
The Problem
Questions
• What is a good length for a neighborhood
based lexical signature?
• 5 or 7 terms like lexical signatures?
• 5..8 terms like tag-based queries?
• How many backlinks do we need?
• Is the 1st level of backlinks sufficient?
• From where in the linking page should we
draw the candidate terms?
120
The
Problem
The
Radius
Question
Entire page
Paragraph
Anchor text
121
The Dataset
• Same as for JCDL 2010 experiment
• 309 URIs
• 28,325 first level & 306,700 second level backlinks
• Filter for language, file type, content length, HTTP
response code, “soft 404s” => 12% discarded
• Lexical signature generation
• IDF values from Yahoo!
• 1..7 and 10 terms
122
The
Problem
The
Results
Anchor
text
level-radius-rank
123
The Problem
The Results
– Backlink Level
Anchor
text
±
5 words
124
level-radius-rank
The Problem
The Results
– Backlink Level
Anchor
text
±
10 words
125
level-radius-rank
The Problem
The Results
– Backlink Level
Anchor
text
±
10 words
126
level-radius-rank
The
Problem
The
Results
– Radius
All Radii
level-radius-rank
127
The Problem
The Results
– Backlink Rank
Anchor,
Ranks
10,
100,
1000
128
level-radius-rank
Problem
The The
Results
– In Numbers
1-anchor-1000
• 4 terms
• first backlink level only
• top 10 backlinks only
• anchor text only
1-anchor-10
129
WINNER
Concluding Remarks
Conclusions
Link neighborhood based lexical signatures can help
rediscover missing pages.
It is a feasible “Plan C” due to the high success rate of
cheaper methods (titles, tags, lexical signatures).
Fortunately smallest parameters perform best (anchor, 10
backlinks, 1st level backlinks)
Can we find an optimum for the number of backlinks?
(10/100/1000 leaves a big margin)
Can we identify “Stop Anchors” e.g. click here, acrobat, etc
130
Download