Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied Computing 2006 Motivation Link-based ranking algorithms are important to current popular search engines. (e.g., HITS for Teoma) Link farms will deteriorate the performance of link-based ranking algorithms HITS algorithm Each page has two measures, authority score a shows how good this page is for a query, hub score h shows the possibility that the page points to good authority pages. E is the adjacency matrix. a = ET h h=Ea Example: for query “weather” http://www.tripadvisor.com/ http://www.virtualtourist.com/ http://www.abed.com/memoryfoam.html http://www.abed.com/furniture.html http://www.rental-car.us/ http://www.accommodation-specials.com/ http://www.lasikeyesurgery.com/ http://www.lasikeyesurgery.com/lasik-surgery.asp http://mortgage-rate-refinancing.com/ http://mortgage-rate-refinancing.com/mortgagecalculator.html Factors that degrade HITS Mutually reinforcing relationships Duplicate pages Link farms Complete hyperlink Definition: The link with its anchor text as a unit. Duplication of a complete link is a much stronger sign of copying behavior on the Web than a duplicate link target. Document - Complete link Matrix Bipartite Graph Two disjoint sets X and Y, each edge starts from an element in X and ends with an element in Y. Link farms Link farms are usually densely connected via multiple overlapping small bipartite cores. Task: to detect densely connected bipartite components from “document - complete link” matrix Algorithm for finding bipartite components Result: k=2 and l=2 Adjustment: document-document matrix Final matrix Weighted adjacency matrix Experiment: HITS result of “rental car” http://www.discountcars.net/ http://www.motel-discounts.com/ http://www.stlouishoteldeals.com/ http://www.richmondhoteldeals.com/ http://www.jacksonvillehoteldeals.com/ http://www.jacksonhoteldeals.com/ http://www.keywesthoteldeals.com/ http://www.austinhoteldeals.com/ http://www.gatlinburghoteldeals.com/ http://www.ashevillehoteldeals.com/ Experiment: B&H HITS result of “rental car” http://www.rentadeal.com/ http://www.allaboutstlouis.com/ http://www.allaboutboston.com/ https://travel2.securesites.com/ about_travelguides/addlisting.html http://www.allaboutsanfranciscoca.com/ http://www.allaboutwashingtondc.com/ http://www.allaboutalbuquerque.com/ http://www.allabout-losangeles.com/ http://www.allabout-denver.com/ http://www.allabout-chicago.com/ Experiment: CL-HITS result of “rental car” http://www.hertz.com/ http://www.avis.com/ http://www.nationalcar.com/ http://www.thrifty.com/ http://www.dollar.com/ http://www.alamo.com/ http://www.budget.com/ http://www.enterprise.com/ http://www.budgetrentacar.com/ http://www.europcar.com/ Experiment: B&H HITS result of “translation online” http://www.no-gambling.com/ http://www.teleorg.org/ http://ong.altervista.org/ http://bx.b0x.com/ http://video-poker.batcave.net/ http://www.websamba.com/marketing-campaigns http://online-casino.o-f.com/ http://caribbean-poker.webxis.com/ http://roulette.zomi.net/ http://teleservices.netfirms.com/ Experiment: CL-HITS result of “translation online” http://www.freetranslation.com/ http://www.systransoft.com/ http://babelfish.altavista.com/ http://www.yourdictionary.com/ http://dictionaries.travlang.com/ http://www.google.com/ http://www.foreignword.com/ http://www.babylon.com/ http://www.worldlingo.com/products_services /worldlingo_translator.html http://www.allwords.com/ Duplicate example: BH-HITS result of “maps” http://www.maps.com/ http://www.mapsworldwide.com/ http://www.cartographic.com/ http://www.amaps.com/ http://www.cdmaps.com/ http://www.ewpnet.com/maps.htm http://mapsguidesandmore.com/ http://www.njdiningguide.com/maps.html http://www.stanfords.co.uk/ http://www.delorme.com/ Duplicate example: CL-HITS result of “maps” http://www.maps.com/ http://maps.yahoo.com/ http://www.delorme.com/ http://tiger.census.gov/ http://www.davidrumsey.com/ http://memory.loc.gov/ammem/gmdhtml/gmdhome.html http://www.esri.com/ http://www.maptech.com/ http://www.streetmap.co.uk/ http://www.libs.uga.edu/darchive/hargrett/maps/maps.html User evaluation Category HITS BHITS CL-HITS CL-POP Quite relevant 12.9% 24.5% 48.4% 46.3% Relevant 10.7% 18.3% 28.8% 26.2% Not sure 6.6% 10.5% 6.7% 6.4% Irrelevant 26.8% 14.8% 11.3% 12.7% Totally irrelevant 42.8% 31.9% 4.6% 8.1% Discussion Using link alone, the precision at 10 is 66.4%. Much lower than using “complete link”. Random anchor texts. Questions? baw4@cse.lehigh.edu davison@cse.lehigh.edu