cs895-f10_mklein - ODU Computer Science

Synchronicity Time on the Web - Week 3 CS 895 Fall 2010 Martin Klein mklein@cs.odu.edu 09/15/2010 The Problem http://www.jcdl2007.org http://www.jcdl2007.org/JCDL2007_Program.pdf 2 The Problem • Web users experience 404 errors • expected lifetime of a web page is 44 days [Kahle97] • 2% of web disappears every week [Fetterly03] • Are they really gone? Or just relocated? • has anybody crawled and indexed it? • do Google, Yahoo!, Bing or the IA have a copy of that page? • Information retrieval techniques needed to (re-)discover content 3 The Environment Web Infrastructure (WI) [McCown07] • Web search engines (Google, Yahoo!, Bing) and their caches • Web archives (Internet Archive) • Research projects (CiteSeer) 4 Refreshing and Migration in the WI Digital preservation happens in the WI Google Scholar CiteSeerX Internet Archive 5 URI – Content Mapping Problem 1 U1 U1 C1 C1 A time B U1 404 3 U1 U2 C1 C1 A time same URI maps to same or very similar content at a later time 2 different URI maps to same or very similar content at the 4 same or at a later time U1 U1 C1 C2 A time B U1 U1 C1 ??? A time same URI maps to different content at a later time the content can not be found at any URI B B 6 Content Similarity JCDL 2005 http://www.jcdl2005.org/ July 2005 http://www.jcdl2005.org/ Today 7 Content Similarity Hypertext 2006 http://www.ht06.org/ August 2006 http://www.ht06.org/ Today 8 Content Similarity PSP 2003 http://www.pspcentral.org/events/annual_meeting_2003.html http://www.pspcentral.org/events/archive/annual_meeting_2003.html August 2003 Today 9 Content Similarity ECDL 1999 http://www-rocq.inria.fr/EuroDL99/ http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html October 1999 Today 10 Content Similarity Greynet 1999 http://www.konbib.nl/infolev/greynet/2.5.htm 1999 ? Today ? 11 Lexical Signatures (LSs) • First introduced by Phelps and Wilensky [Phelps00] • Small set of terms capturing “aboutness” of a document, “lightweight” metadata Resource Abstract LS Removal Google Hit Yahoo Rate Proxy Cache 12 Generation of Lexical Signatures • • Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88] Term frequency (TF): – “How often does this word appear in this document?” • Inverse document frequency (IDF): – “In how many documents does this word appear?” 13 LS as Proposed by Phelps and Wilensky • “Robust Hyperlink” • 5 terms are suitable • Append LS to URL http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature= texttiling+wilensky+disambiguation+subtopic+iago • Limitations: 1. Applications (browsers) need to be modified to exploit LSs 2. LSs need to be computed a priori 3. Works well with most URLs but not with all of them 14 Generation of Lexical Signatures • Park et al. [Park03] investigated performance of various LS generation algorithms • Evaluated “tunability” of TF and IDF component • • Weight on TF increases recall (completeness) Weight on IDF improves precision (exactness) 15 Lexical Signatures -- Examples Rank/Results URL 1/1 http://www.cs.berkeley.edu/ ˜wilensky/NLP.html LS texttiling wilensky disambiguation subtopic iago http://www.google.com/search?q=texttiling+wile nsky+disambiguation+subtopic+iago na/10 http://www.dli2.nsf.gov nsdl multiagency imls testbeds extramural http://www.google.com/search?q=nsdl+multiag ency+imls+testbeds+extramural 1/221,000 http://www.loc.gov (1/174,000 in 01/2008) 1/51 (2/77 in 01/2008) library collections congress thomas american http://www.google.com/search?q=library+colle ctions+congress+thomas+american http://www.jcdl2008.org libraries jcdl digital conference pst http://www.google.com/search?q=libraries+jcdl +digital+conference+pst 16 Synchronicity 404 error occurs while browsing look for same or older page in WI (1) if user satisfied return page   (2) else  generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page   (4) else  get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages   (6) The system may not return any results at all  17 Synchro…What? Synchronicity • Experience of causally unrelated events occurring together in a meaningful manner • Events reveal underlying pattern, framework bigger than any of the synchronous systems • Carl Gustav Jung (1875-1961) • “meaningful coincidence” • Deschamps – de Fontgibu plum pudding example 18 picture from http://www.crystalinks.com/jung.html 404 Errors 19 404 Errors 20 “Soft 404” Errors 21 “Soft 404” Errors 22 A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web (WIDM 2008) The Problem • LSs are usually generated following the TF-IDF scheme • TF rather trivial to compute • IDF requires knowledge about: • overall size of the corpus (# of documents) • # of documents a term occurs in • Also not complicated to compute for bounded corpora (such as TREC) • If the web is the corpus, values can only be estimated The Idea • Use IDF values obtained from 1. Local collection of web pages 2. ``screen scraping‘‘ SE result pages • Validate both methods through comparison to baseline • Use Google N-Grams as baseline • Note: N-Grams provide term count (TC) and not DF values – details to come Accurate IDF Values for LSs Screen scraping the Google web interface 26 The Dataset Local universe consisting of copies of URLs from the IA between 1996 and 2007 27 The Dataset Same as above, follows Zipf distribution 10,493 observations 254,384 total terms 16,791 unique terms The Dataset Total terms vs new terms LSs Example Based on all 3 methods URL: http://www.perfect10wines.com Year: 2007 Union: 12 unique terms Comparing LSs 1. Normalized term overlap • Assume term commutativity • k-term LSs normalized by k 2. Kendall Tau • Modified version since LSs to compare may contain different terms 3. M-Score • Penalizes discordance in higher ranks Comparing LSs Top 5, 10 and 15 terms LC – local universe SC – screen scraping NG – N-Grams Conclusions • Both methods for the computation of IDF values provide accurate results • compared to the Google N-Gram baseline • Screen scraping method seems preferable since • similaity scores slightly higher • feasible in real time Correlation of Term Count and Document Frequency for Google N-Grams (ECIR 2009) The Problem • Need of a reliable source to accurately compute IDF values of web pages (in real time) • Shown, screen scraping works but • missing validation of baseline (Google NGrams) • N-Grams seem suitable (recently created, based on web pages) but provide TC and not DF  what is their relationship? Background & Motivation • Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept • Used (among others) to generate lexical signatures (LSs) • TF is not hard to compute, IDF is since it depends on global knowledge about the corpus  When the entire web is the corpus IDF can only be estimated! • Most text corpora provide term count values (TC) D1 = “Please, Please Me” D3 = “All You Need Is Love” D2 = “Can’t Buy Me Love” D4 = “Long, Long, Long” Term All Buy Can’t Is Love Me Need Please You Long TC 1 1 1 1 2 2 1 2 1 3 DF 1 1 1 1 2 2 1 1 1 1 TC >= DF but is there a correlation? Can we use TC to estimate DF? 36 The Idea • Investigate relationship between: • TC and DF within the Web as Corpus (WaC) • WaC based TC and Google N-Gram based TC • TREC, BNC could be used but: • they are not free • TREC has been shown to be somewhat dated [Chiang05 ] The Experiment • Analyze correlation of list of terms ordered by their TC and DF rank by computing: • Spearman‘s Rho • Kendall Tau • Display frequency of TC/DF ratio for all terms • Compare TC (WaC) and TC (N-Grams) frequencies Experiment Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Rank similarity of all terms 39 Experiment Results Investigate correlation between TC and DF within “Web as Corpus” (WaC) Spearman’s ρ and Kendall τ 40 Experiment Results Top 10 terms in decreasing order of their TF/IDF values taken from http://ecir09.irit.fr Rank WaC-DF WaC-TC Google N-Grams 1 IR IR IR IR 2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG 3 IRSG IRSG IRSG RETRIEVAL 4 BCS IRIT CONFERENCE BCS 5 IRIT BCS BCS EUROPEAN 6 CONFERENCE 2009 GRANT CONFERENCE 7 GOOGLE FILTERING IRIT IRIT 8 2009 GOOGLE FILTERING GOOGLE 9 FILTERING CONFERENCE EUROPEAN ACM 10 GRANT ARIA PAPERS GRANT U = 14 ∩=6 Strong indicator that TC can be used to estimate DF for web pages! Google: screen scraping DF (?) values from the Google web interface 41 Experiment Results Frequency of TC/DF Ratio Within the WaC Two Decimals One Decimal Integer Values Experiment Results Show similarity between WaC based TC and Google N-Gram based TC TC frequencies 43 N-Grams have a threshold of 200 Conclusions • TC and DF Ranks within the WaC show strong correlation • TC frequencies of WaC and Google N-Grams are very similiar • Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages  Does not mean everything correlated to TC can be used as DF substitude! Inter-Search Engine Lexical Signature Performance (JCDL 2009) Inter-Search Engine Lexical Signature Performance Martin Klein Michael L. Nelson {mklein,mln}@cs.odu.edu Elephant Tusks Trunk African Loxodonta http://en.wikipedia.org/wiki/Elephant Elephant, African, Tusks Asian, Trunk Elephant, Asian, African Species, Trunk 47 Revisiting Lexical Signatures to (Re-)Discover Web Pages (ECDL 2008) How to Evaluate the Evolution of LSs over Time Idea: • Conduct overlap analysis of LSs • LSs based on local universe mentioned above • Neither Phelps and Wilensky nor Park et al. did that • Park et al. just re-confirmed their findings after 6 month 49 Dataset Local universe consisting of copies of URLs from the IA between 1996 and 2007 50 LSs Over Time - Example 10-term LSs generated for http://www.perfect10wines.com LS Overlap Analysis Rooted: overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observed Sliding: overlap between two LSs of consecutive years starting with the first year and ending with the last 52 Evolution of LSs over Time Rooted Results: • Little overlap between the early years and more recent ones • Highest overlap in the first 1-2 years after creation of the LS • Rarely peaks after that – once terms are gone do not return 53 Evolution of LSs over Time Sliding Results: • Overlap increases over time • Seem to reach steady state around 2003 54 Performance of LSs Idea: • Query Google search API with LSs • LSs based on local universe mentioned above • Identify URL in result set • For each URL it is possible that: 1. 2. 3. 4. URL is returned as the top ranked result URL is ranked somewhere between 2 and 10 URL is ranked somewhere between 11 and 100 URL is ranked somewhere beyond rank 100  considered as not returned 55 Performance of LSs wrt Number of Terms Results: • 2-, 3- and 4-term LSs perform poorly • 5-, 6- and 7-term LSs seem best • Top mean rank (MR) value with 5 terms • Most top ranked with 7 terms • Binary pattern: either in top 10 or undiscovered • 8 terms and beyond do not show improvement 56 Performance - Number ofNumber Termsof Terms Performance of LSs wrt Rank distribution of 5 term LSs • Lightest gray = rank 1 • Black = rank 101 and beyond • Ranks 11-20, 2130,… colored proportionally • 50% top ranked, 20% in top 10, 30% black 57 Performance of LSs Scoring (generalized from Park et al.) Equation in Section 6.1 • Fair: • Gives credit to all URLs equally with linear spacing between ranks • Optimistic: • Bigger penalty for lower ranks • Scores for the position of a URL in a list of 10: • Fair: 10/10, 9/10, 8/10 … 1/10, 0 • Optimistic: 1/1, 1/2, 1/3 … 1/10, 0 58 Performance of LSs wrt Number of Terms Fair and optimistic score for LSs consisting of 2-15 terms (mean values over all years) 59 Performance of LSs over Time Score for LSs consisting of 2, 5, 7 and 10 terms Fair Optimistic 60 Conclusions • LSs decay over time • Rooted: quickly after generation • Sliding: seem to stabilize • 5-, 6- and 7-term LSs seem to perform best • 7 – most top ranked • 5 – fewest undiscovered • 5 – lowest mean rank • 8 terms and beyond hurt performance Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure (JCDL 2010) The TheProblem Problem Internet Archive Wayback Machine www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharterinternational.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter 63 International 59 copies 63 The TheProblem Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry 64 64 The TheProblem Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 65 The TheProblem Problem If no archived/cached copy can be found... Link Neighborhood (LNLS) Tags A ? B 66 C The TheProblem Problem 67 Contributions Contributions • Compare performance of four automated methods to rediscover web pages 1. Lexical signatures (LSs) 3. Tags 2. Titles 4. LNLS • Analysis of title characteristics wrt their retrieval performance • Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery 68 Experiment - Data Gathering Data Gathering • 500 URIs randomly sampled from DMOZ • Applied filters – .com, .org, .net, .edu domains – English Language – min. of 50 terms [Park] • Results in 309 URIs to download and parse 69 Experiment - Data Gathering Data Gathering • Extract title – <Title>...</Title> • Generate 3 LSs per page – IDF values obtained from Google, Yahoo!, MSN Live • Obtain tags from delicious.com API (only 15%) • Obtain link neighborhood from Yahoo! API (max. 50 URIs) – Generate LNLS – TF from “bucket” of words per neighborhood 70 – IDF obtained from Yahoo! API LS RetrievalLSPerformance Retrieval Performance 5- and 7-Term LSs • • 71 Yahoo! returns most URIs top ranked and leaves least undiscovered Binary retrieval pattern, URI either within top 10 or undiscovered Title Retrieval TitlePerformance Retrieval Performance Non-Quoted and Quoted Titles • • • 72 Results at least as good as for LSs Google and Yahoo! return more URIs for non-quoted titles Same binary retrieval pattern Tags Retrieval Tags Performance Retrieval Performance • • 73 API returns up to top10 tags distinguish between # of tags queried Low # of URIs LNLS Retrieval Performance LNLS Retrieval Performance • • 74 5- and 7-term LNLSs < 5% top ranked CombinationCombination of Methods of Methods Can we achieve better retrieval performance if we combine 2 or more methods? Query LS Done Query Title Done Query Tags Done 75 Query LNLS CombinationCombination of Methods of Methods LS5 LS7 TI TA Top 50.8 57.3 69.3 2.1 Top10 12.6 9.1 8.1 10.6 Undis 32.4 31.1 19.7 75.5 Yahoo! LS5 LS7 TI TA 76 Top 63.1 62.8 61.5 0 Top10 8.1 5.8 6.8 8.5 Undis 27.2 29.8 30.7 80.9 Google LS5 LS7 TI TA Top 67.6 66.7 63.8 6.4 MSN Live Top10 7.8 4.5 8.1 17.0 Undis 22.3 26.9 27.5 63.8 CombinationCombination of Methods of Methods Top Results for Combination of Methods LS5-TI LS7-TI TI-LS5 TI-LS7 LS5-TI-LS7 LS7-TI-LS5 TI-LS5-LS7 TI-LS7-LS5 LS5-LS7 LS7-LS5 77 Google 65.0 70.9 73.5 74.1 65.4 71.2 73.8 74.4 52.8 59.9 Yahoo! 73.8 75.7 75.7 75.1 73.8 76.4 75.7 75.7 68.0 71.5 MSN Live 71.5 73.8 73.1 74.1 72.5 74.4 74.1 74.8 64.4 66.7 Title Characteristics Title Characteristics Length in # of Terms • • 78 Length varies between 1 and 43 terms Length between 3 and 6 terms occurs most frequently and performs well [Ntoulas] Title Characteristics Title Characteristics Length in # of Characters • • • • 79 Length varies between 4 and 294 characters Short titles (<10) do not perform well Length between 10 and 70 most common Length between 10 and 45 seem to perform best Title Characteristics Title Characteristics Mean # of Characters, # of Stop Words • • 80 Title terms with a mean of 5,6,7 characters seem most suitable for well performing terms More than 1 or 2 stop words hurts performance Concluding Remarks Conclusions Lexical signatures, as much as titles, are very suitable as search engine queries to rediscover missing web pages. They return 50-70% URIs top ranked. Tags and link neighborhood LSs do not seem to significantly contribute to the retrieval of the web pages. Titles are much cheaper to obtain than LSs. The combination of primarily querying titles and 5-term LSs as a second option returns more than 75% URIs top ranked. Not all titles are equally good. Titles containing between 3 and 6 terms seem to perform best. More than a couple of stop words hurt the performance. 81 Is This a Good Title? (Hypertext 2010) The TheProblem Problem Professional Scholarly Publishing 2003 http://www.pspcentral.org/events/annual_meeting_2003.html 83 The TheProblem Problem Internet Archive Wayback Machine www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharterinternational.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter 84 International 59 copies 84 The TheProblem Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry 85 85 The TheProblem Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 86 The TheProblem Problem http://www.drbartell.com/ Lexical Signature (TF/IDF) Plastic Surgeon Reconstructive Dr Bartell Symbol University 87 ??? The TheProblem Problem http://www.drbartell.com/ Title Thomas Bartell MD BoardCertified Cosmetic Plastic Reconstructive Surgery 88 The TheProblem Problem www.reagan.navy.mil Lexical Signature (TF/IDF) Ronald USS MCSN Torrey Naval Sea Commanding 89 89 The TheProblem Problem www.reagan.navy.mil Title Home Page Is This a Good Title? 90 ??? Contributions Contributions • Discuss discovery performance of web pages titles (compared to LSs) • Analysis of discovered pages regarding their relevancy • Display title evolution compared to content evolution over time • Provide prediction model for title’s retrieval potential 91 Experiment - Data Gathering Data Gathering • 20k URIs randomly sampled from DMOZ • Applied filters – English language – min. of 50 terms • Results in 6.875 URIs • Downloaded and parsed the pages • Extract title and generate LS per page (baseline) 92 Original Filtered .com 15289 4863 .org 2755 1327 .net 1459 369 .edu 497 316 sum 20000 6875 Title (and LS)and Retrieval Performance Title LS Retrieval Performance Titles • • 93 5- and 7-Term LSs Titles return more than 60% URIs top ranked Binary retrieval pattern, URI either within top 10 or undiscovered RelevancyRelevancy of Retrieval Results of Retrieval Results Do titles return relevant results besides the original URI? Distinguish between discovered (top 10) and undiscovered URIs • • • Analyze content of top 10 results Measure relevancy in terms of normalized term overlap and shingles between original URI and search 94result by rank ??? RelevancyRelevancy of Retrieval Results of Retrieval Results Discovered 95 Term Overlap Undiscovered High relevancy in the top ranks with possible aliases and duplicates. RelevancyRelevancy of Retrieval Results of Retrieval Results Discovered Shingles Undiscovered More optimal shingles values than top ranked URIs possible aliases and duplicates. 96 Title Evolution Example I Title-Evolution – Example I www.sun.com/solutions 1998-01-27 Sun Software Products Selector Guides - Solutions Tree 1999-02-20 Sun Software Solutions 2002-02-01 Sun Microsystems Products 2002-06-01 Sun Microsystems - Business & Industry Solutions 2003-08-01 Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions 97 2004-02-02 Sun Microsystems – Solutions 2004-06-10 Gateway Page - Sun Solutions 2006-01-09 Sun Microsystems Solutions & Services 2007-01-03 Services & Solutions 2007-02-07 Sun Services & Solutions 2008-01-19 Sun Solutions Title Evolution Example II Title -Evolution – Example II www.datacity.com/mainf.html 2000-06-19 DataCity of Manassas Park Main Page 2000-10-12 DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives 2001-08-21 DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives 98 2002-10-16 computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free 2006-03-14 Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB Title Evolution Time TitleOver Evolution Over Time How much do titles change over time? • • • Copies from fixed size time windows per year Extract available titles of past 14 years Compute normalized Levenshtein edit distance between titles of copies and baseline (0 = identical; 1 = completely 99dissimilar) Title Evolution Time TitleOver Evolution Over Time Title edit distance frequencies • • • Half the titles of available copies from recent years are (close to) identical Decay from 2005 on (with fewer copies available) 4 year old title: 40% chance to be 100 unchanged Title Evolution Time TitleOver Evolution Over Time • • • • Title vs Document Y: avg shingle value for all copies per URI [0,1] - over 1600 times X: avg edit distance of corresponding titles overlap indicated by: green: <10 red: >90 Semi-transparent: total amount of points plotted 101 [0,0] - 122 times Title Performance Prediction Title Performance Prediction • • Quality prediction of title by • • Number of nouns, articles etc. Amount of title terms, characters ([Ntoulas]) Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”! [Ntoulas] 102 A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 Concluding Remarks Conclusions The “aboutness” of web pages can be determined from either the content or from the title. More than 60% of URIs are returned top ranked when using the title as a search engine query. Titles change more slowly and less significantly over time than the web pages’ content. Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor. 103 Comparing the Performance of US College Football Teams in the Web and on the Field (Hypertext 2009) Naming NamingConventions Conventions Football 105 Soccer Motivation Motivation • “Does Authority mean Quality?”[Amento00] • Link-based web page metrics can be used to estimate experts’ assessment of quality • Lists compiled by experts are cool! – Companies, schools, people, places, etc • “Big 3” search engines play a central role in our lives – “If I can’t find it in the top 10 it doesn’t exist in the web” – SEOs • Do expert rankings of real-world entities correlate with search engine ranking of corresponding web resources? 106 Background Background • • • • • • • 107 Expert ranking of real-world entities: Collegiate football programs in the US Associated Press (AP) poll • 65 sportswriters and broadcasters USA Today Coaches poll • 63 college football head coaches Published once a week, top 25 teams, 25-1 point system “Big 3” search engines Google, Yahoo and MSN Live (APIs) US College Football Season 2008 US College Football Season 2008 • • • • • • 108 2008 season began on August 28th 2008 Concluded January 8th 2009 18 instances of poll data: Final polls from 2007 season (as a baseline) 2008 pre-season polls once for each of the 16 weeks of the 2008 season Mapping Resources toURIs URLs Mapping Resources to • • Often impossible to distill the canonical URL for a football program e.g. Virginia Tech college football returned • • • • 109 Official school page Commercial sports sites Wikipedia Blogs, Fan sites, etc Mapping Resources toURIs URLs Mapping Resources to • • • 110 Query 3 search engine APIs for representative URLs • • Query: schoolname+College+Football e.g.: Ohio+State+College+Football Aggregate the top 8 representative URLs (n = 1 .. 8) Temporal aspect in mind: • Repeat query and renew aggregation weekly Ordinal Ranking of of URLs SE Queries Ordinal Ranking URIs from from SE Queries We are not interested in computing search engine’s absolute ranking for a particular URL (PR values) BUT We are determining that a search engine ranks URLs in order 111 Ordinal Ranking of of URLs SE Queries Ordinal Ranking URIs from from SE Queries • • • • Search engines enforce query restrictions (length, amount per day etc) Build unbiased and overlapping queries site and OR operators Variation of strand sort USC Georgia Ohio State Oklahoma Florida site:http://usctrojans.cstv.com/sports/m-footbl/usc-m-footbl-body.html OR site:http://uga.rivals.com/ OR site:http://sportsillustrated.cnn.com/football/ncaa/teams/ohiost/ OR site:http://www.soonersports.com/ OR site:http://www.gatorzone.com/ 112 Weighting Ranked Weighting Ranked URLs URIs • • If real-world resources are mapped to more than one URL (n > 1) • • Need to accumulate ranking score Determine one final overall school score Assign weights per URL depending on their rank P - Position of URL in result set T - Total number of URLs in the list (n * number of teams) 113 Correlation CorrelationResults Results Kendall Tau used to test for statistically significant (p<0.05) correlation 114 Top 10 AP Poll Top 10 USA Poll Correlation CorrelationResults Results Kendall Tau used to test for statistically significant (p<0.05) correlation “Inertia” 115 Top 25 AP Poll Top 25 USA Poll n-Values for N-Values forCorrelation Correlation Top 10 AP Poll 116 Top 10 USA Poll n-Values for N-Values forCorrelation Correlation Top 25 AP Poll 117 n=2..6 Top 25 USA Poll Correlation of Overlapping URLs of Overlapping URIs Over Time OverCorrelation Time • • • • • • 12 schools occur in all AP polls throughout the season Given the “inertia”, by how much does the web trail? Can we measure a “delayed correlation”? Declare AP ranking for each week as separate “truth values” Compute correlation between truth values and search engine ranking Expect to see in increased correlation in the weeks following the truth value 118 USC Florida Alabama Georgia Ohio State Oklahoma Missouri Texas Texas Tech BYU Penn State Utah Correlation of Overlapping URLs Correlation Over Timeof Overlapping URIs Over Time 119 n=8 Correlation between Attendance Correlation Attendance and SE and Polls and SEBetween and Polls AP USA Today Google n=1 Google n=6 120 Concluding Remarks Conclusions • Inspired by “Does Authority mean Quality?” we asked “Does Quality mean Authority?” • High correlations for the last seasons final rankings and rankings early in the season • Correlation decreases because of “inertia” • No correlation between attendance and search engine rankings 121 Although authority means quality, quality does not necessarily mean authority - at least not immediately.

cs895-f10_mklein - ODU Computer Science

Related documents

Products

Support

cs895-f10_mklein - ODU Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib