Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the Web and Identification of Communities, Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee. Introduction How do links determine web communities? Natural community formation vs. web authors manipulating nepotistic links Theoretical graph theory vs. artificial learning program Both papers are fairly dated, from 2002 What is a Web Community? A collection of web pages where each member page has more links within the community than outside the community. Goal: To identify web communities. Why? For practical applications and web analysis Maximum Flow Communities • Given a directed graph G = (V, E), with edge capacities c(u, v) ϵ Z+, and two vertices s, t ϵ V, find the maximum flow that can be routed from the source, s, to the sink, t, that obeys all capacity constraints • The Max Flow-Min Cut theorem proves that the maximum flow of the network = minimum cut that separates s and t Exact vs. Approximate Flow Communities Exact: The “sink” is artificial and generic, ie. it receives from every edge from every other vertex Accepts any bi-directional link The community is very connected internally, but isolated from the rest of the graph Approximate: Determined by a fixed depth crawl Uses the exact-flow-community algorithm, then chooses the highest-ranked sites and repeats the algorithm Rank determined by number of edges site has to within the community This model used for study as it better represents the actual web Score determined by total # of inbound and outbound links a page has to other pages in its community… Sample Results Francis Crick Community 80 Biography of Francis Harry Compton Crick (Nobel Foundation) 79 Biography of James Dewey Watson (Nobel Foundation) 51 The Nobel Prize in Physiology or Medicine 1962 (Nobel Foundation) 50 Biographical Sketch of James Dewey Watson (Cold Spring Harbor Lab.) 41 A structure for Deoxyribose Nucleic Acid (Nature, April 2, 1953) ... 1 Felix D’Herelle and the Origins of Molecular Biology (Amazon.com) 1 Biography of Gregor Mendel 1 Magazine: HMS Beagle Home 1 The Alfred Russel Wallace Page 1 U.S. Human Genome Project 5 Year Plan Community Most Significant Text Features crick, nobel, dna, “francis crick”, “the nobel”, “of dna”, watson, “james watson”, francis, molecular, biology, genetics, “watson and”, “structure of”, “crick and” Stephen Hawking Community 85 Professor Stephen W. Hawking’s web pages 46 Stephen Hawking’s Universe at PBS 17 The Stephen Hawking Pages 15 Stephen Hawking Builds Robotic Exoskeleton (parody at the Onion) 14 Stephen Hawking and Intel ... 1 Did the cosmos arise from nothing? MSNBC story 1 Spanish page for Stephen Hawking’s Universe 1 Relativity Group at DAMTP, Cambridge 1 Millennium Mathematics Project 1 Particle Physics Education and Information Sites hawking, “stephen hawking”, stephen, “hawking s”, “s universe”, physics, “black holes”, “the universe”, cambridge, cosmology, einstein, relativity, damtp, “universe the” Ronald Rivest Community 86 Ronald L. Rivest : Home Page 29 Chaffing and Winnowing: Confidentiality without Encryption 20 Thomas H. Cormen’s home page at Dartmouth 9 The Mathematical Guts of RSA Encryption 8 German news story on Cryptography ... 1 Phil Zimmermann’s PGP web page 1 A Very Brief History of Computer Science 1 Cormen / Leiserson / Rivest: Introduction to Algorithms 1 Security and Encryption Links 1 HotBot Directory: Computers & Internet, Computer Science, People: R rivest, “l rivest”, “ronald l”, ronald, cryptography, rsa, “ron rivest”, lcs, “theory lcs”, encryption, “lcs mit”, theory, chaffing, winnowing, crypto Results, con’t Communities are strongly topically related in the form of binary classifiers Study used three-term binary classifiers like crick or nobel or darwin (54% match for the Francis Crick community, but only 0.5% for random web pages), hawking or relativity or “for mathematical”(84% Stephen Hawking community, 0.2% random pages) to determine communities Breadth-first crawling strategies do not yield topically relevant pages (only 10% of pages at a depth of two matched classification rules) What are Nepotistic Links? Nepotistic Links: Links between pages that are present for reasons other than merit Sites that are run by the same administrative control, like About.com Advertising/paid links Note: different from duplicate pages or mirrored sites Eba6.com Mapquesy.com Preliminary Experiments Two data sets were used: 1. 1536 arbitrarily selected manually labeled links 2. 750 random links from DiscoWeb search engine’s 7 million pages, also manually labeled as either nepotistic or not 75 binary features were used: Identical page titles or descriptions? Page descriptions overlapped at least some percentage of the text Identical complete host names? Some number of initial IP address identical? Pages share at least some percentage of outgoing links Domains had same contact email address? Machine Learning C4.5 decision tree package used to determine the binary features Results Results, con’t Can classify links with more accuracy if one uses already categorized search engine results as “training data” Second set of data too small – does not represent the variety of sites on the web Nepotistic links largely do not affect popular pages Conclusions Both experiments focused on binary classifiers Naïve researchers: scale of web is too large to run any of these algorithms on it, both used small sample sizes to begin with