CS 572: Characterizing the Web

advertisement
Hypersearching the Web,
Chakrabarti, Soumen
Presented By
Ray Yamada
Overview
• Why Do We Care?
• Purpose of The Paper?
• Solution by Clever Project
• Pros / Cons of the Paper
• Further Research
Why Do We Care?
• Web Link Analysis is crucial for efficient Crawling and
Ranking algorithms
• Crawling: Google Sitemap Submission, Yahoo Directory
• Ranking: Relevant Result
Purpose of The Paper?
• To Overcome These Challenges:
– Its Size & Growth
– Its Content Types
– Language Semantics
– New Language
– Staleness of Results
– SPAM
– And More…
Solution: Hyperlinks, Hyperlinks, Hyperlinks…
• Can Think of the Web as
a Directed Graph
• Node = Web page (URL)
• Edge = Hyperlink
Solution: HITS Algorithm
• Hyperlink-Induced Topic Search (HITS)
– A.k.a. Hubs and Authorities
• Hubs – Highly-valued lists for a given query
– Ex. Yahoo Directory, Open Directory Project and
Bookmarking sites.
• Authorities – Highly endorsed answers to the query
– Ex. New York Times, Huffington Post, Twitter
• It is possible for a webpage to be both Hub and Authority
– Ex. Restaurant Review Blogs
Solution: HITS Algorithm Cont…
• For each page p, we assign it two values hub(p) and auth(p)
• Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any
predetermined number)
• Authority Update Rule: For each page p, update auth(p) to be the
sum of the hub scores of all pages that point to it.
• Hub Update Rule: For each page p, update hub(p) to be the sum of
the authority scores of all pages that point to it.
• Normalize and Repeat
Solution: HITS Algorithm Cont…
Calculation
Hub(p)
Sum:
0.249
0.321
0.181
0.123
0.088
0.015
0.018
0.003
0.003
1.00
Num of Links Raw Score
3
0.747
4
1.284
2
0.362
2
0.246
2
0.176
1
0.015
2
0.036
1
0.003
1
0.003
2.872
Authority Pages (q) Raw Score
Auth(q)
SJ Merc News
0.57
0.198
Wall St. Journal
0.57
0.198
New York Times
0.874
0.304
USA Today
0.59
0.205
Facebook
0.123
0.043
Yahoo!
0.121
0.042
Amazon
0.024
0.008
Sum:
1.000
Pros:
– Accurately addresses concerns
and challenges we currently
deal with
– Great introduction to search
engine algorithm
– Briefly covered many topics
(Breadth)
Cons:
– Some materials are out of date
(1999)
– Ex. Google vs. Clever Project
– Lack of Depth
– Ex. Normalization of Hub
and Auth values
Further Research: HITS Algorithm – Extreme Cases
• Large-in-small-out sites
– High Auth(p)
– No Problem
• Small-in-large-out sites
– High Hub(p)
– Problem
Further Research: HITS + Relevance Scoring Method
• Vector Space Model (VSM)
– Documents and queries are represented by vectors
– Term Frequency
• Okapi Measurement
– Term Frequency + Document Length
• Cover Density Ranking (CDR)
– Phrase Similarity (How close terms appear)
Further Research: HITS + Relevance Scoring Method
• Use Cosine Relevance Test
Price
Car
Further Research: HITS + Relevance Scoring Method
• Three-Level Scoring Method (TLS)
– Manual Evaluation of Relevance
•
•
•
•
Relevant Links = 2 points
Slightly Relevant Links = 1 point
Inactive Links + Error Links (404, 603) = 0 point
Irrelevant Links = 0 point
– Order of query terms matters
Further Research: Co-citation Graph
• Regular Link Graph:
• Co-citation Graph:
What’s Next?
• Google’s New Search Index: Caffeine
– Announced June 8th, 2010
– Up to 50% fresher results
http://googleblog.blogspot.com/2010/06/our-new-search-indexcaffeine.html
– Twice as fast
• Real Time Search
– Twitter / Facebook
References
•
•
•
•
•
Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar;
Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web"
[Article]. Scientific American, June1999, ():.
Longzhuang Li , Yi Shang , Wei Zhang, Improvement of HITS-based algorithms on
web documents, Proceedings of the 11th international conference on World Wide
Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi>10.1145/511446.511514]
Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing,
5(1), 45-50.
Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked
environment" (PDF). Journal of the ACM 46 (5): 604–
632. doi:10.1145/324133.324140.
von Ahn, Luis (2008-10-19). "Hubs and Authorities" (PDF). 15-396: Science of the
Web Course Notes. Carnegie Mellon University. Retrieved 2008-11-09.
Q&A
Download