CS246 Link-Based Ranking Problems of TFIDF Vector Works well on small controlled corpus, but not on the Web Easy to spam Top result for “American Airlines” query: accident report of American Airline flights Do users really care how many times “American Airlines” mentioned? Ranking purely based on page content Authors can manipulate page content to get high ranking Any idea? Link-based Ranking People “expect” to get AA home page for the query “American Airlines” Many pages point to AA home page, but not to accident report Use link-count! Simple Link Count Still easy to spam Create many pages and add links to a page How to avoid spam? PageRank A page is important if it is pointed by many important pages PR(p) = PR(p1)/n1 + … + PR(pk)/nk pi : page pointing to p, ni : number of links in pi PageRank of p is the sum of PageRanks of its parents One equation for every page N equations, N unknown variables Example: Web of 1842 Netscape, Microsoft and Amazon PR(n) = PR(n)/2 + PR(a)/2 PR(m) = +PR(a)/2 PR(a) = PR(n)/2 + PR(m) Ne MS Am n 1 / 2 0 1 / 2 n m 0 0 1 / 2 m a 1 / 2 1 0 a PageRank: Matrix Notation Web graph matrix M = { mij } Each page i corresponds to row i and column i of the matrix M mij = 1/n if page i is one of the n children of page j mij = 0 otherwise PageRank vector PageRank equation p M p p1 p p2 p3 PageRank: Iterative Computation Initially every page has a unit of importance At each round, each page shares its importance among its children and receives new importance from its parents Eventually the importance of each page reaches a limit Stochastic matrix Example: Web of 1842 n 1 / 2 0 1 / 2 n m 0 0 1 / 2 m a 1 / 2 1 0 a n 1 / 3 m 1 / 3 a 1 / 3 1 / 3 1 / 6 1 / 2 5 / 12 3 / 8 1/ 4 1/ 6 1 / 3 11/ 24 Ne MS Am 5 / 12 11/ 48 17 / 48 2 / 5 1 / 5 2 / 5 PageRank: Eigenvector PageRank equation p M p p is the principal eigenvector of M PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click Problems on the Real Web Dead end A page with no links to send importance All importance “leak out of” the Web Crawler trap A group of one or more pages that have no links out of the group Accumulate all the importance of the Web Example: Dead End No link from Microsoft Dead end Ne MS Am n 1 / 2 0 1 / 2 n m 0 0 1 / 2 m a 1 / 2 0 0 a Example: Dead End n 1 / 2 0 1 / 2 n m 0 0 1 / 2 m a 1 / 2 0 0 a n 1 / 3 m 1 / 3 a 1 / 3 1 / 3 1 / 6 1 / 6 1/ 4 1 / 12 1 / 6 Ne MS Am 5 / 24 1 / 12 1 / 8 1/ 6 1 / 16 5 / 48 0 0 0 Solution to Dead End Assume a surfer to jumps to a random page at a dead end n 1 / 2 1 / 3 1 / 2 n m 0 1 / 3 1 / 2 m a 1 / 2 1 / 3 0 a Ne MS Am Example: Crawler Trap Only self-link at Microsoft Crawler trap Ne MS Am n 1 / 2 0 1 / 2 n m 0 1 1 / 2 m a 1 / 2 0 0 a Example: Crawler Trap n 1 / 2 0 1 / 2 n m 0 1 1 / 2 m a 1 / 2 0 0 a n 1 / 3 m 1 / 3 a 1 / 3 1 / 3 1 / 2 1 / 6 1/ 4 7 / 12 1 / 6 Ne MS Am 5 / 24 2/3 1 / 8 1/ 6 35 / 48 5 / 48 0 1 0 Crawler Trap: Damping Factor “Tax” each page some fraction of its importance and distribute it equally Probability to jump to a random page Assuming 20% tax n 1 / 2 0 1 / 2 n 1 / 3 m 0.8 0 1 1 / 2 m 0.2 1 / 3 a 1 / 2 0 0 a 1 / 3 PR( pi ) d PR( p j ) / c j (1 d ) / N j n 7 / 33 m 7 / 11 a 5 / 33 Link Spam Problem Q: What if a spammer creates a lot of pages and create a link to a single spam page? PageRank better than simple link count, but still vulnerable to link spam Q: Any way to avoid link spam? TrustRank [Gyongyi et al. 2004] Good pages don’t point to spam pages Trust a page only if it is linked by what you trust TR( pi ) d TR( p j ) / c j bi j (1 d ) / NT bi 0 if pi is trusted otherwise Same as PageRank except the random jump probability term TrustRank: Theory [Bianchini et al. 2005] Given P( pi ) d P( p j ) / c j bi j consider a set of pages S S IN(S) OUT(S) DP(S) TrustRank: Theory [Bianchini et al. 2005] PS P( p) BS pS POUT PIN P( p ) / c pi OUT ( S ) i i b pi S PDP i P( p ) pi DP ( S ) i P( p ) / c pi IN ( S ) i i d d PS B S PIN POUT 1 d 1 d d PDP 1 d What Does It Mean? PS BS PIN POUT PDP PS = 0 if BS= 0 and PIN= 0 You cannot improve your TrustRank simply by creating more pages and linking within yourself To get non-zero TrustRank, you need to be either trusted or get links from outside Is TrustRank the Ultimate Solution? Not really… Honeypot: A page with good content with hidden links to spams Blogs, forums, wikis, mailing lists Easy to add spam links Link exchange Good users link to honeypot due to its quality content Set of sites exchanging links to boost ranking A never-ending rat race… Anti-Spamming at Search Engines Anchor text Consider what others think about your page Give higher weights to anchors from high PageRank pages More difficult to spam TrustRank To gain importance, you need to convince many pages under other’s control or convince search engines More difficult to spam Consider inter-site links with higher weight Hub and Authority More detailed evaluation of importance A page is useful if It has good contents or It has links to useful pages (good bookmark) Hub/Authority Authority: pages with good contents Hub: pages pointing to good content pages Hub/Authority: Definition Recursive definition similar to PageRank Authority pages are linked to by many hub pages Hub pages link to many authority pages H(p) = A(p1) + … + A(pk) A(p) = H(p1) + … + H(pm) Hub/Authority: Matrix Notation Web graph matrix A = { aij } Each page i corresponds to row i and column i of the matrix A aij = 1 if page i points to page j aij = 0 otherwise A is not a stochastic matrix AT: similar to PageRank matrix M, without stochastic restriction Example: Web of 1842 [n, m, a]: vector Ne 1 1 1 A 0 0 1 1 1 0 1 0 1 AT 1 0 1 1 1 0 MS Am Hub/Authority: Iterative Computation Hub/Authority vector hn h hm ha T a A h an a am aa : divergence scaling factor : divergence scaling factor h A a Compute a and hiteratively with scaling Hub/Authority: Eigenvector T a A h h Aa T T T a A h A Aa A A a h A a A AT h AAT h T a : eigenvector of A A T : eigenvector of AA h Example: Web of 1842 2 2 1 AT A 2 2 1 1 1 2 3 1 2 AAT 1 1 0 2 0 2 Ne an 1 a am 1 aa 1 5 5 4 24 24 18 114 114 84 hn 1 h hm 1 ha 1 6 2 4 28 8 20 132 36 96 MS Am 3 1 3 1 2 3 1 3 1 2 Hub/Authority and Root Set Apply the equations on a small neighbor graph (base set) Start with, say, 100 pages on “bicycling” Add pages pointing to the 100 pages Add pages that the 100 pages are pointing to Identified pages are good “Hub” and “Authority” on “bicycling” Hub/Authority and Web Community Hub/Authority is often used to identify Web communities Nice notion of “Hub” and “Authority” of the community Often Hub and Authority are tightly linked to each other Any Questions? Questions Can we apply Hub/Authority to the entire Web like PageRank? Hub/Authority on the Entire Web? Hub/Authority works well on a topic-specific subset, but works poorly for the whole Web Easy to spam 1. 2. Create a page pointing to many authority pages (e.g.,Yahoo, Google, etc.) The page becomes a good hub page On the page, add a link to your home page Questions Can we apply PageRank to a small base set? PageRank on a Small Subset In general, PageRank works better for larger dataset We may be able to compute “topic-specific” PageRank Any other way for “topic-specific” PageRank? Summary: Link-Based Ranking PageRank TrustRank variation Hub/Authority