AudreyKaoPresentation

advertisement
Recognizing Communities
on the Web
CS349 Presentation by
Audrey Kao
Recognizing Nepotistic Links on the Web, Brian D. Davison.
Self-Organization of the Web and Identification of Communities, Gary William
Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee.
Introduction

How do links determine web communities?

Natural community formation vs. web authors manipulating nepotistic links

Theoretical graph theory vs. artificial learning program

Both papers are fairly dated, from 2002
What is a Web Community?
A collection of web pages where each member page has more links within
the community than outside the community.
Goal: To identify web communities. Why?
For practical applications and web analysis
Maximum Flow Communities
• Given a directed graph G = (V, E), with edge
capacities c(u, v) ϵ Z+, and two vertices s, t ϵ V, find
the maximum flow that can be routed from the
source, s, to the sink, t, that obeys all capacity
constraints
• The Max Flow-Min Cut theorem proves that
the maximum flow of the network = minimum cut that separates s and t
Exact vs. Approximate Flow
Communities

Exact: The “sink” is artificial and generic, ie. it receives from every edge
from every other vertex



Accepts any bi-directional link
The community is very connected internally, but isolated from the rest of the
graph
Approximate: Determined by a fixed depth crawl



Uses the exact-flow-community algorithm, then chooses the highest-ranked sites
and repeats the algorithm
Rank determined by number of edges site has to within the community
This model used for study as it better represents the actual web
Score determined by total # of inbound and outbound links a page has to other
pages in its community…
Sample Results
Francis Crick Community
80 Biography of Francis Harry Compton Crick (Nobel Foundation)
79 Biography of James Dewey Watson (Nobel Foundation)
51 The Nobel Prize in Physiology or Medicine 1962 (Nobel Foundation)
50 Biographical Sketch of James Dewey Watson (Cold Spring Harbor Lab.)
41 A structure for Deoxyribose Nucleic Acid (Nature, April 2, 1953)
...
1 Felix D’Herelle and the Origins of Molecular Biology (Amazon.com)
1 Biography of Gregor Mendel
1 Magazine: HMS Beagle Home
1 The Alfred Russel Wallace Page
1 U.S. Human Genome Project 5 Year Plan
Community Most Significant Text Features
crick, nobel, dna, “francis crick”, “the nobel”, “of dna”, watson, “james watson”,
francis, molecular, biology, genetics, “watson and”, “structure of”, “crick and”
Stephen Hawking Community
85 Professor Stephen W. Hawking’s web pages
46 Stephen Hawking’s Universe at PBS
17 The Stephen Hawking Pages
15 Stephen Hawking Builds Robotic Exoskeleton (parody at the Onion)
14 Stephen Hawking and Intel
...
1 Did the cosmos arise from nothing? MSNBC story
1 Spanish page for Stephen Hawking’s Universe
1 Relativity Group at DAMTP, Cambridge
1 Millennium Mathematics Project
1 Particle Physics Education and Information Sites
hawking, “stephen hawking”, stephen, “hawking s”, “s universe”, physics,
“black holes”, “the universe”, cambridge, cosmology, einstein, relativity, damtp,
“universe the”
Ronald Rivest Community
86 Ronald L. Rivest : Home Page
29 Chaffing and Winnowing: Confidentiality without Encryption
20 Thomas H. Cormen’s home page at Dartmouth
9 The Mathematical Guts of RSA Encryption
8 German news story on Cryptography
...
1 Phil Zimmermann’s PGP web page
1 A Very Brief History of Computer Science
1 Cormen / Leiserson / Rivest: Introduction to Algorithms
1 Security and Encryption Links
1 HotBot Directory: Computers & Internet, Computer Science, People: R
rivest, “l rivest”, “ronald l”, ronald, cryptography, rsa, “ron rivest”, lcs, “theory lcs”,
encryption, “lcs mit”, theory, chaffing, winnowing, crypto
Results, con’t

Communities are strongly topically related in the form of binary classifiers


Study used three-term binary classifiers like
crick or nobel or darwin (54% match for the Francis Crick community, but only
0.5% for random web pages),
hawking or relativity or “for mathematical”(84% Stephen Hawking community,
0.2% random pages) to determine communities
Breadth-first crawling strategies do not yield topically relevant pages (only
10% of pages at a depth of two matched classification rules)
What are Nepotistic Links?

Nepotistic Links: Links between pages that
are present for reasons other than merit



Sites that are run by the same administrative
control, like About.com
Advertising/paid links
Note: different from duplicate pages or
mirrored sites
Eba6.com
Mapquesy.com
Preliminary Experiments

Two data sets were used:



1. 1536 arbitrarily selected manually labeled links
2. 750 random links from DiscoWeb search engine’s 7 million pages, also
manually labeled as either nepotistic or not
75 binary features were used:






Identical page titles or descriptions?
Page descriptions overlapped at least some percentage of the text
Identical complete host names?
Some number of initial IP address identical?
Pages share at least some percentage of outgoing links
Domains had same contact email address?
Machine Learning

C4.5 decision tree package used to determine the binary features
Results
Results, con’t

Can classify links with more accuracy if one uses already categorized
search engine results as “training data”

Second set of data too small –
does not represent the variety of sites on the web

Nepotistic links largely do not affect popular pages
Conclusions

Both experiments focused on binary classifiers

Naïve researchers: scale of web is too large to run any of these algorithms
on it, both used small sample sizes to begin with
Download