cs895-f10_mklein - ODU Computer Science

advertisement
Synchronicity
Time on the Web - Week 3
CS 895 Fall 2010
Martin Klein
mklein@cs.odu.edu
09/15/2010
The Problem
http://www.jcdl2007.org
http://www.jcdl2007.org/JCDL2007_Program.pdf
2
The Problem
• Web users experience 404 errors
• expected lifetime of a web page is 44 days [Kahle97]
• 2% of web disappears every week [Fetterly03]
• Are they really gone? Or just relocated?
• has anybody crawled and indexed it?
• do Google, Yahoo!, Bing or the IA have a copy of
that page?
• Information retrieval techniques needed to
(re-)discover content
3
The Environment
Web Infrastructure (WI) [McCown07]
• Web search engines (Google, Yahoo!, Bing)
and their caches
• Web archives (Internet Archive)
• Research projects (CiteSeer)
4
Refreshing and Migration in the WI
Digital preservation happens in the WI
Google Scholar
CiteSeerX
Internet Archive
5
URI – Content Mapping Problem
1
U1
U1
C1
C1
A
time
B
U1
404
3
U1
U2
C1
C1
A
time
same URI maps
to same or very
similar content
at a later time
2
different URI
maps to same or
very similar
content at the
4
same or at a
later time
U1
U1
C1
C2
A
time
B
U1
U1
C1
???
A
time
same URI
maps to
different
content at a
later time
the content
can not be
found at any
URI
B
B
6
Content Similarity
JCDL 2005
http://www.jcdl2005.org/
July 2005
http://www.jcdl2005.org/
Today
7
Content Similarity
Hypertext 2006
http://www.ht06.org/
August 2006
http://www.ht06.org/
Today
8
Content Similarity
PSP 2003
http://www.pspcentral.org/events/annual_meeting_2003.html
http://www.pspcentral.org/events/archive/annual_meeting_2003.html
August 2003
Today
9
Content Similarity
ECDL 1999
http://www-rocq.inria.fr/EuroDL99/
http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html
October 1999
Today
10
Content Similarity
Greynet 1999
http://www.konbib.nl/infolev/greynet/2.5.htm
1999
?
Today
?
11
Lexical Signatures (LSs)
• First introduced by Phelps and Wilensky [Phelps00]
• Small set of terms capturing “aboutness” of a
document, “lightweight” metadata
Resource
Abstract
LS
Removal Google
Hit
Yahoo
Rate
Proxy
Cache
12
Generation of Lexical Signatures
•
•
Following TF-IDF scheme first introduced by
Spaerck Jones and Robertson [Jones88]
Term frequency (TF):
– “How often does this word appear in this
document?”
•
Inverse document frequency (IDF):
– “In how many documents does this word appear?”
13
LS as Proposed by Phelps and Wilensky
• “Robust Hyperlink”
• 5 terms are suitable
• Append LS to URL
http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=
texttiling+wilensky+disambiguation+subtopic+iago
• Limitations:
1. Applications (browsers) need to be modified to
exploit LSs
2. LSs need to be computed a priori
3. Works well with most URLs but not with all of
them
14
Generation of Lexical Signatures
• Park et al. [Park03] investigated performance of
various LS generation algorithms
• Evaluated “tunability” of TF and IDF
component
•
•
Weight on TF increases recall (completeness)
Weight on IDF improves precision (exactness)
15
Lexical Signatures -- Examples
Rank/Results
URL
1/1
http://www.cs.berkeley.edu/
˜wilensky/NLP.html
LS
texttiling wilensky
disambiguation subtopic iago
http://www.google.com/search?q=texttiling+wile
nsky+disambiguation+subtopic+iago
na/10
http://www.dli2.nsf.gov
nsdl multiagency imls testbeds
extramural
http://www.google.com/search?q=nsdl+multiag
ency+imls+testbeds+extramural
1/221,000
http://www.loc.gov
(1/174,000 in
01/2008)
1/51
(2/77 in
01/2008)
library collections congress
thomas american
http://www.google.com/search?q=library+colle
ctions+congress+thomas+american
http://www.jcdl2008.org
libraries jcdl digital conference
pst
http://www.google.com/search?q=libraries+jcdl
+digital+conference+pst
16
Synchronicity
404 error occurs while browsing
look for same or older page in WI (1)
if user satisfied
return page   (2)
else 
generate LS from retrieved page (3)
query SEs with LS
if result sufficient
return “good enough” alternative page   (4)
else 
get more input about desired content (5)
(link neighborhood, user input,...)
re-generate LS && query SEs ...
return pages   (6)
The system may not return any results at all 
17
Synchro…What?
Synchronicity
• Experience of causally unrelated events
occurring together in a meaningful manner
• Events reveal underlying pattern, framework
bigger than any of the synchronous systems
• Carl Gustav Jung (1875-1961)
•
“meaningful coincidence”
• Deschamps – de Fontgibu plum
pudding example
18
picture from http://www.crystalinks.com/jung.html
404 Errors
19
404 Errors
20
“Soft 404” Errors
21
“Soft 404” Errors
22
A Comparison of Techniques for
Estimating IDF Values to Generate
Lexical Signatures for the Web
(WIDM 2008)
The Problem
• LSs are usually generated following the TF-IDF
scheme
• TF rather trivial to compute
• IDF requires knowledge about:
• overall size of the corpus (# of documents)
• # of documents a term occurs in
• Also not complicated to compute for bounded
corpora (such as TREC)
• If the web is the corpus, values can only be
estimated
The Idea
• Use IDF values obtained from
1. Local collection of web pages
2. ``screen scraping‘‘ SE result pages
• Validate both methods through comparison to
baseline
• Use Google N-Grams as baseline
• Note: N-Grams provide term count (TC)
and not DF values – details to come
Accurate IDF Values for LSs
Screen scraping the Google web interface
26
The Dataset
Local universe consisting of copies of URLs from the IA
between 1996 and 2007
27
The Dataset
Same as above, follows Zipf distribution
10,493 observations
254,384 total terms
16,791 unique terms
The Dataset
Total terms vs new terms
LSs Example
Based on all 3 methods
URL: http://www.perfect10wines.com
Year: 2007
Union: 12 unique terms
Comparing LSs
1. Normalized term overlap
• Assume term commutativity
• k-term LSs normalized by k
2. Kendall Tau
• Modified version since LSs to compare
may contain different terms
3. M-Score
• Penalizes discordance in higher ranks
Comparing LSs
Top 5, 10 and 15
terms
LC – local universe
SC – screen scraping
NG – N-Grams
Conclusions
• Both methods for the computation of IDF
values provide accurate results
• compared to the Google N-Gram baseline
• Screen scraping method seems preferable
since
• similaity scores slightly higher
• feasible in real time
Correlation of Term Count and Document
Frequency for Google N-Grams
(ECIR 2009)
The Problem
• Need of a reliable source to accurately
compute IDF values of web pages (in real time)
• Shown, screen scraping works but
• missing validation of baseline (Google NGrams)
• N-Grams seem suitable (recently created,
based on web pages) but provide TC and not DF
 what is their relationship?
Background & Motivation
• Term frequency (TF) – inverse document frequency (IDF) is a well known
term weighting concept
• Used (among others) to generate lexical signatures (LSs)
• TF is not hard to compute, IDF is since it depends on global knowledge
about the corpus
 When the entire web is the corpus IDF can only be estimated!
• Most text corpora provide term count values (TC)
D1 = “Please, Please Me”
D3 = “All You Need Is Love”
D2 = “Can’t Buy Me Love”
D4 = “Long, Long, Long”
Term
All
Buy
Can’t
Is
Love
Me
Need
Please
You
Long
TC
1
1
1
1
2
2
1
2
1
3
DF
1
1
1
1
2
2
1
1
1
1
TC >= DF but is there a correlation? Can we use TC to estimate DF?
36
The Idea
• Investigate relationship between:
• TC and DF within the Web as Corpus (WaC)
• WaC based TC and Google N-Gram based
TC
• TREC, BNC could be used but:
• they are not free
• TREC has been shown to be somewhat
dated
[Chiang05 ]
The Experiment
• Analyze correlation of list of terms ordered by
their TC and DF rank by computing:
• Spearman‘s Rho
• Kendall Tau
• Display frequency of TC/DF ratio for all terms
• Compare TC (WaC) and TC (N-Grams)
frequencies
Experiment Results
Investigate correlation between TC and DF
within “Web as Corpus” (WaC)
Rank similarity of all terms
39
Experiment Results
Investigate correlation between TC and DF
within “Web as Corpus” (WaC)
Spearman’s ρ and Kendall τ
40
Experiment Results
Top 10 terms in decreasing order of their TF/IDF values
taken from http://ecir09.irit.fr
Rank
WaC-DF
WaC-TC
Google
N-Grams
1
IR
IR
IR
IR
2
RETRIEVAL
RETRIEVAL
RETRIEVAL
IRSG
3
IRSG
IRSG
IRSG
RETRIEVAL
4
BCS
IRIT
CONFERENCE
BCS
5
IRIT
BCS
BCS
EUROPEAN
6
CONFERENCE
2009
GRANT
CONFERENCE
7
GOOGLE
FILTERING
IRIT
IRIT
8
2009
GOOGLE
FILTERING
GOOGLE
9
FILTERING
CONFERENCE
EUROPEAN
ACM
10
GRANT
ARIA
PAPERS
GRANT
U = 14
∩=6
Strong indicator
that TC can be
used to estimate
DF for web pages!
Google: screen scraping DF (?) values from the Google web interface
41
Experiment Results
Frequency of TC/DF Ratio Within the WaC
Two Decimals
One Decimal
Integer Values
Experiment Results
Show similarity between WaC based TC and
Google N-Gram based TC
TC frequencies
43
N-Grams have a threshold of 200
Conclusions
• TC and DF Ranks within the WaC show strong
correlation
• TC frequencies of WaC and Google N-Grams
are very similiar
• Together with results shown earlier (high
correlation between baseline and two other
methods) N-Grams seem suitable for accurate
IDF estimation for web pages
 Does not mean everything correlated to TC
can be used as DF substitude!
Inter-Search Engine
Lexical Signature Performance
(JCDL 2009)
Inter-Search Engine
Lexical Signature Performance
Martin Klein
Michael L. Nelson
{mklein,mln}@cs.odu.edu
Elephant
Tusks
Trunk
African
Loxodonta
http://en.wikipedia.org/wiki/Elephant
Elephant, African, Tusks
Asian, Trunk
Elephant, Asian, African
Species, Trunk
47
Revisiting Lexical Signatures to
(Re-)Discover Web Pages
(ECDL 2008)
How to Evaluate the Evolution of LSs over Time
Idea:
• Conduct overlap analysis of LSs
• LSs based on local universe mentioned above
• Neither Phelps and Wilensky nor Park et al.
did that
• Park et al. just re-confirmed their findings after 6
month
49
Dataset
Local universe consisting of copies of URLs from the IA
between 1996 and 2007
50
LSs Over Time - Example
10-term LSs generated for
http://www.perfect10wines.com
LS Overlap Analysis
Rooted:
overlap between the
LS of the year of the
first observation in the
IA and all LSs of the
consecutive years that
URL has been
observed
Sliding:
overlap between
two LSs of
consecutive years
starting with the
first year and
ending with the
last
52
Evolution of LSs over Time
Rooted
Results:
• Little overlap between the early years and more recent ones
• Highest overlap in the first 1-2 years after creation of the LS
• Rarely peaks after that – once terms are gone do not return
53
Evolution of LSs over Time
Sliding
Results:
• Overlap increases over time
• Seem to reach steady state around 2003
54
Performance of LSs
Idea:
• Query Google search API with LSs
• LSs based on local universe mentioned above
• Identify URL in result set
• For each URL it is possible that:
1.
2.
3.
4.
URL is returned as the top ranked result
URL is ranked somewhere between 2 and 10
URL is ranked somewhere between 11 and 100
URL is ranked somewhere beyond rank 100
 considered as not returned
55
Performance of LSs wrt Number of Terms
Results:
• 2-, 3- and 4-term LSs perform poorly
• 5-, 6- and 7-term LSs seem best
• Top mean rank (MR) value with 5 terms
• Most top ranked with 7 terms
• Binary pattern: either in top 10 or undiscovered
• 8 terms and beyond do not show improvement
56
Performance
- Number
ofNumber
Termsof Terms
Performance
of LSs wrt
Rank distribution of 5 term LSs
• Lightest gray =
rank 1
• Black = rank 101
and beyond
• Ranks 11-20, 2130,… colored
proportionally
• 50% top ranked,
20% in top 10,
30% black
57
Performance of LSs
Scoring (generalized from Park et al.)
Equation in Section 6.1
• Fair:
• Gives credit to all URLs equally with linear spacing
between ranks
• Optimistic:
• Bigger penalty for lower ranks
• Scores for the position of a URL in a list of 10:
• Fair: 10/10, 9/10, 8/10 … 1/10, 0
• Optimistic: 1/1, 1/2, 1/3 … 1/10, 0
58
Performance of LSs wrt Number of Terms
Fair and optimistic score for LSs consisting of 2-15 terms
(mean values over all years)
59
Performance of LSs over Time
Score for LSs consisting of 2, 5, 7 and 10 terms
Fair
Optimistic
60
Conclusions
• LSs decay over time
• Rooted: quickly after generation
• Sliding: seem to stabilize
• 5-, 6- and 7-term LSs seem to perform best
• 7 – most top ranked
• 5 – fewest undiscovered
• 5 – lowest mean rank
• 8 terms and beyond hurt performance
Evaluating Methods to Rediscover
Missing Web Pages from the
Web Infrastructure
(JCDL 2010)
The
TheProblem
Problem
Internet Archive Wayback Machine
www.aircharter-international.com
http://web.archive.org/web/*/http://www.aircharterinternational.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
63
International
59 copies
63
The
TheProblem
Problem
www.aircharter-international.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
64
64
The
TheProblem
Problem
www.aircharter-international.com
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
International
65
The
TheProblem
Problem
If no archived/cached copy can be found...
Link
Neighborhood
(LNLS)
Tags
A
?
B
66
C
The
TheProblem
Problem
67
Contributions
Contributions
• Compare performance of four automated
methods to rediscover web pages
1. Lexical signatures (LSs)
3. Tags
2. Titles
4. LNLS
• Analysis of title characteristics wrt their retrieval
performance
• Evaluate performance of combination of
methods and suggest workflow for real time
web page rediscovery
68
Experiment - Data
Gathering
Data
Gathering
• 500 URIs randomly sampled from DMOZ
• Applied filters
– .com, .org, .net, .edu domains
– English Language
– min. of 50 terms [Park]
• Results in 309 URIs to download and parse
69
Experiment - Data
Gathering
Data
Gathering
• Extract title
– <Title>...</Title>
• Generate 3 LSs per page
– IDF values obtained from Google, Yahoo!, MSN Live
• Obtain tags from delicious.com API (only 15%)
• Obtain link neighborhood from Yahoo! API
(max. 50 URIs)
– Generate LNLS
– TF from “bucket” of words per neighborhood
70
– IDF obtained from Yahoo! API
LS RetrievalLSPerformance
Retrieval Performance
5- and 7-Term LSs
•
•
71
Yahoo! returns
most URIs top
ranked and leaves
least undiscovered
Binary retrieval
pattern, URI either
within top 10 or
undiscovered
Title Retrieval
TitlePerformance
Retrieval Performance
Non-Quoted and Quoted Titles
•
•
•
72
Results at least as
good as for LSs
Google and Yahoo!
return more URIs
for non-quoted
titles
Same binary
retrieval pattern
Tags Retrieval
Tags Performance
Retrieval Performance
•
•
73
API returns up to
top10 tags distinguish
between # of tags
queried
Low # of URIs
LNLS Retrieval
Performance
LNLS Retrieval
Performance
•
•
74
5- and 7-term LNLSs
< 5% top ranked
CombinationCombination
of Methods
of Methods
Can we achieve better retrieval performance if we
combine 2 or more methods?
Query LS
Done
Query Title
Done
Query Tags
Done
75
Query LNLS
CombinationCombination
of Methods
of Methods
LS5
LS7
TI
TA
Top
50.8
57.3
69.3
2.1
Top10
12.6
9.1
8.1
10.6
Undis
32.4
31.1
19.7
75.5
Yahoo!
LS5
LS7
TI
TA
76
Top
63.1
62.8
61.5
0
Top10
8.1
5.8
6.8
8.5
Undis
27.2
29.8
30.7
80.9
Google
LS5
LS7
TI
TA
Top
67.6
66.7
63.8
6.4
MSN Live
Top10
7.8
4.5
8.1
17.0
Undis
22.3
26.9
27.5
63.8
CombinationCombination
of Methods
of Methods
Top Results for Combination of Methods
LS5-TI
LS7-TI
TI-LS5
TI-LS7
LS5-TI-LS7
LS7-TI-LS5
TI-LS5-LS7
TI-LS7-LS5
LS5-LS7
LS7-LS5
77
Google
65.0
70.9
73.5
74.1
65.4
71.2
73.8
74.4
52.8
59.9
Yahoo!
73.8
75.7
75.7
75.1
73.8
76.4
75.7
75.7
68.0
71.5
MSN Live
71.5
73.8
73.1
74.1
72.5
74.4
74.1
74.8
64.4
66.7
Title Characteristics
Title Characteristics
Length in # of Terms
•
•
78
Length varies
between 1 and 43
terms
Length between 3
and 6 terms occurs
most frequently and
performs well [Ntoulas]
Title Characteristics
Title Characteristics
Length in # of Characters
•
•
•
•
79
Length varies
between 4 and 294
characters
Short titles (<10) do
not perform well
Length between 10
and 70 most
common
Length between 10
and 45 seem to
perform best
Title Characteristics
Title Characteristics
Mean # of Characters, # of Stop Words
•
•
80
Title terms with a
mean of 5,6,7
characters seem
most suitable for
well performing
terms
More than 1 or 2
stop words hurts
performance
Concluding Remarks
Conclusions
Lexical signatures, as much as titles, are very suitable as
search engine queries to rediscover missing web pages. They
return 50-70% URIs top ranked.
Tags and link neighborhood LSs do not seem to significantly
contribute to the retrieval of the web pages.
Titles are much cheaper to obtain than LSs.
The combination of primarily querying titles and 5-term LSs as
a second option returns more than 75% URIs top ranked.
Not all titles are equally good.
Titles containing between 3 and 6 terms seem to perform
best. More than a couple of stop words hurt the performance.
81
Is This a Good Title?
(Hypertext 2010)
The
TheProblem
Problem
Professional Scholarly Publishing 2003
http://www.pspcentral.org/events/annual_meeting_2003.html
83
The
TheProblem
Problem
Internet Archive Wayback Machine
www.aircharter-international.com
http://web.archive.org/web/*/http://www.aircharterinternational.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
84
International
59 copies
84
The
TheProblem
Problem
www.aircharter-international.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
85
85
The
TheProblem
Problem
www.aircharter-international.com
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
International
86
The
TheProblem
Problem
http://www.drbartell.com/
Lexical Signature
(TF/IDF)
Plastic Surgeon
Reconstructive Dr
Bartell Symbol
University
87
???
The
TheProblem
Problem
http://www.drbartell.com/
Title
Thomas Bartell
MD BoardCertified Cosmetic Plastic
Reconstructive
Surgery
88
The
TheProblem
Problem
www.reagan.navy.mil
Lexical Signature
(TF/IDF)
Ronald USS MCSN
Torrey Naval Sea
Commanding
89
89
The
TheProblem
Problem
www.reagan.navy.mil
Title
Home Page
Is This a
Good Title?
90
???
Contributions
Contributions
• Discuss discovery performance of web
pages titles (compared to LSs)
• Analysis of discovered pages regarding
their relevancy
• Display title evolution compared to
content evolution over time
• Provide prediction model for title’s
retrieval potential
91
Experiment - Data
Gathering
Data
Gathering
• 20k URIs randomly sampled from DMOZ
• Applied filters
– English language
– min. of 50 terms
• Results in 6.875 URIs
• Downloaded and parsed the pages
• Extract title and generate LS per page (baseline)
92
Original
Filtered
.com
15289
4863
.org
2755
1327
.net
1459
369
.edu
497
316
sum
20000
6875
Title (and
LS)and
Retrieval
Performance
Title
LS Retrieval
Performance
Titles
•
•
93
5- and 7-Term LSs
Titles return more than 60% URIs top ranked
Binary retrieval pattern, URI either within top 10 or
undiscovered
RelevancyRelevancy
of Retrieval
Results
of Retrieval
Results
Do titles return relevant
results besides the original
URI?
Distinguish between
discovered (top 10) and
undiscovered URIs
•
•
•
Analyze content of top 10
results
Measure relevancy in terms
of normalized term overlap
and shingles between
original URI and search
94result by rank
???
RelevancyRelevancy
of Retrieval
Results
of Retrieval
Results
Discovered
95
Term Overlap
Undiscovered
High relevancy in the top ranks
with possible aliases and duplicates.
RelevancyRelevancy
of Retrieval
Results
of Retrieval
Results
Discovered
Shingles
Undiscovered
More optimal shingles values than top ranked URIs possible
aliases and duplicates.
96
Title Evolution
Example
I
Title-Evolution
– Example
I
www.sun.com/solutions
1998-01-27
Sun Software Products Selector Guides
- Solutions Tree
1999-02-20
Sun Software Solutions
2002-02-01
Sun Microsystems Products
2002-06-01
Sun Microsystems - Business & Industry
Solutions
2003-08-01
Sun Microsystems - Industry &
Infrastructure Solutions Sun Solutions
97
2004-02-02
Sun Microsystems – Solutions
2004-06-10
Gateway Page - Sun Solutions
2006-01-09
Sun Microsystems Solutions & Services
2007-01-03
Services & Solutions
2007-02-07
Sun Services & Solutions
2008-01-19
Sun Solutions
Title Evolution
Example
II
Title -Evolution
– Example
II
www.datacity.com/mainf.html
2000-06-19
DataCity of Manassas Park Main Page
2000-10-12
DataCity of Manassas Park sells
Custom Built Computers & Removable
Hard Drives
2001-08-21
DataCity a computer company in
Manassas Park sells Custom Built
Computers & Removable Hard Drives
98
2002-10-16
computer company in Manassas Virginia
sells Custom Built Computers with
Removable Hard Drives Kits and Iomega
2GB Jaz Drives (jazz drives) October
2002 DataCity 800-326-5051 toll free
2006-03-14
Est 1989 Computer company in Stafford
Virginia sells Custom Built Secure
Computers with DoD 5200.1-R Approved
Removable Hard Drives, Hard Drive Kits
and Iomega 2GB Jaz Drives (jazz
drives), introduces the IllumiNite; lighted
keyboard DataCity 800-326-5051
Service Disabled Veteran Owned
Business SDVOB
Title Evolution
Time
TitleOver
Evolution
Over Time
How much do titles change
over time?
•
•
•
Copies from fixed size time
windows per year
Extract available titles of past
14 years
Compute normalized
Levenshtein edit distance
between titles of copies and
baseline
(0 = identical; 1 = completely
99dissimilar)
Title Evolution
Time
TitleOver
Evolution
Over Time
Title edit distance
frequencies
•
•
•
Half the titles of
available copies
from recent years
are (close to)
identical
Decay from 2005 on
(with fewer copies
available)
4 year old title:
40% chance to be
100
unchanged
Title Evolution
Time
TitleOver
Evolution
Over Time
•
•
•
•
Title vs Document
Y: avg shingle value
for all copies per URI
[0,1] - over 1600 times
X: avg edit distance
of corresponding
titles
overlap indicated by:
green: <10
red: >90
Semi-transparent:
total amount of points
plotted
101
[0,0] - 122 times
Title Performance
Prediction
Title Performance
Prediction
•
•
Quality prediction of title by
•
•
Number of nouns, articles etc.
Amount of title terms, characters ([Ntoulas])
Observation of re-occurring terms in poorly performing
titles - “Stop Titles”
home, index, home page, welcome, untitled document
The performance of any given title can be
predicted as insufficient if it consists to 75%
or more of a “Stop Title”!
[Ntoulas]
102
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92
Concluding Remarks
Conclusions
The “aboutness” of web pages can be determined from
either the content or from the title.
More than 60% of URIs are returned top ranked when
using the title as a search engine query.
Titles change more slowly and less significantly over time
than the web pages’ content.
Not all titles are equally good.
If the majority of title terms are Stop Titles its quality can
be predicted poor.
103
Comparing the Performance of
US College Football Teams
in the Web and on the Field
(Hypertext 2009)
Naming
NamingConventions
Conventions
Football
105
Soccer
Motivation
Motivation
• “Does Authority mean Quality?”[Amento00]
• Link-based web page metrics can be used to estimate experts’
assessment of quality
• Lists compiled by experts are cool!
– Companies, schools, people, places, etc
• “Big 3” search engines play a central role in our lives
– “If I can’t find it in the top 10 it doesn’t exist in the web”
– SEOs
• Do expert rankings of real-world entities correlate with search
engine ranking of corresponding web resources?
106
Background
Background
•
•
•
•
•
•
•
107
Expert ranking of real-world entities:
Collegiate football programs in the US
Associated Press (AP) poll
•
65 sportswriters and broadcasters
USA Today Coaches poll
•
63 college football head coaches
Published once a week, top 25 teams, 25-1 point
system
“Big 3” search engines
Google, Yahoo and MSN Live (APIs)
US College
Football
Season
2008
US College
Football Season
2008
•
•
•
•
•
•
108
2008 season began on August 28th 2008
Concluded January 8th 2009
18 instances of poll data:
Final polls from 2007 season (as a baseline)
2008 pre-season polls
once for each of the 16 weeks of the 2008 season
Mapping
Resources
toURIs
URLs
Mapping
Resources to
•
•
Often impossible to distill the
canonical URL for a football
program
e.g. Virginia Tech college football
returned
•
•
•
•
109
Official school page
Commercial sports sites
Wikipedia
Blogs, Fan sites, etc
Mapping
Resources
toURIs
URLs
Mapping
Resources to
•
•
•
110
Query 3 search engine APIs for representative URLs
•
•
Query: schoolname+College+Football
e.g.: Ohio+State+College+Football
Aggregate the top 8 representative URLs (n = 1 .. 8)
Temporal aspect in mind:
•
Repeat query and renew aggregation weekly
Ordinal
Ranking
of of
URLs
SE
Queries
Ordinal
Ranking
URIs from
from SE
Queries
We are not interested in computing search engine’s absolute
ranking for a particular URL (PR values)
BUT
We are determining that a search engine ranks URLs in order
111
Ordinal
Ranking
of of
URLs
SE
Queries
Ordinal
Ranking
URIs from
from SE
Queries
•
•
•
•
Search engines enforce query restrictions (length,
amount per day etc)
Build unbiased and overlapping queries
site and OR operators
Variation of strand sort
USC Georgia Ohio State
Oklahoma Florida
site:http://usctrojans.cstv.com/sports/m-footbl/usc-m-footbl-body.html OR
site:http://uga.rivals.com/ OR
site:http://sportsillustrated.cnn.com/football/ncaa/teams/ohiost/ OR
site:http://www.soonersports.com/ OR
site:http://www.gatorzone.com/
112
Weighting
Ranked
Weighting
Ranked URLs
URIs
•
•
If real-world resources are mapped to more than one
URL (n > 1)
•
•
Need to accumulate ranking score
Determine one final overall school score
Assign weights per URL depending on their rank
P - Position of URL in result set
T - Total number of URLs in the list (n * number of teams)
113
Correlation
CorrelationResults
Results
Kendall Tau used to test for statistically significant (p<0.05) correlation
114
Top 10 AP Poll
Top 10 USA Poll
Correlation
CorrelationResults
Results
Kendall Tau used to test for statistically significant (p<0.05) correlation
“Inertia”
115
Top 25 AP Poll
Top 25 USA Poll
n-Values
for
N-Values
forCorrelation
Correlation
Top 10 AP Poll
116
Top 10 USA Poll
n-Values
for
N-Values
forCorrelation
Correlation
Top 25 AP Poll
117
n=2..6 Top 25 USA Poll
Correlation of Overlapping URLs
of Overlapping URIs Over Time
OverCorrelation
Time
•
•
•
•
•
•
12 schools occur in all AP polls throughout the season
Given the “inertia”, by how much does the web trail?
Can we measure a “delayed correlation”?
Declare AP ranking for each week as separate “truth values”
Compute correlation between truth values and search engine
ranking
Expect to see in increased correlation in the weeks following
the truth value
118
USC
Florida
Alabama
Georgia Ohio State Oklahoma
Missouri
Texas
Texas Tech
BYU
Penn State
Utah
Correlation of Overlapping URLs
Correlation
Over
Timeof Overlapping URIs Over Time
119
n=8
Correlation between Attendance
Correlation
Attendance and SE and Polls
and SEBetween
and Polls
AP
USA
Today
Google
n=1
Google
n=6
120
Concluding
Remarks
Conclusions
• Inspired by “Does Authority mean Quality?” we
asked “Does Quality mean Authority?”
• High correlations for the last seasons final
rankings and rankings early in the season
• Correlation decreases because of “inertia”
• No correlation between attendance and search
engine rankings
121
Although authority means quality,
quality does not necessarily mean
authority - at least not immediately.
Download