Link Analysis & Spam Detection http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 09/24/2013 PageRank Why and how it works? “Random Walker”模型 设想有一个永不休止、在网上浏览网页的人,随机 选择一个链出的链接继续访问。我们问,在稳态情 况下(足够长时间后),他会正在看哪一篇网页呢? 等价于:稳态情况下,每个网页v会有一个被访问 的概率,p(v),它可以作为网页的重要程度的度量。 我们可以合理地设想:此时到达v的概率,依赖于 上一个时刻到达“链向”v的网页的概率,以及那 些网页中超链的个数。 Google Matrix 让这浏览者每次以一定的概率(1-β)沿着超链 走,以概率(β)重新随机选择一个新的起始节 点 这在物理意义上即总是有可能跳进入度为0的点,跳 出那些“圈”。在模型表达上即为 T p (1 ) L p 1N p (1 ) L (1N ) p N N T β选在0.1和0.2之间,被称作damping factor(Page & Brin 1997) G=(1-β)LT+ β/N(1N) 被称为Google Matrix Google Matrix特征向量求解 Power Iteration方法: 给定Google Matrix G,记|λ1| ≥|λ2| ≥…,q1 是属于λ1的特征向量 初始化向量p0,使得||p0||1=1 对于k = 1, 2, …,执行如下步骤 x = Gpk-1, pk = x/||x||1, 基本迭代 规格化步骤 可以证明(收敛速度) |pk – q1| = O(|λ2/λ1|k) T pi 1 (1 ) L (1N ) pi N 例子(power iteration) 1 / 11 1/ 2 L 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 1 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1 1 小规模数据求解 β取0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T P1=GP0 You can try this in MatLab ... 。。。。。。。 Power Iteration求解得(迭代50次) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T Applications of PageRank Algorithm Topic-Sensitive PageRank[1] ODP-Biasing Sqd P(c j q') rankjd j T pi 1 (1 ) L (1N ) pi N Results Precision @ 10 results for test queries Ranking preferred by majority of users Pagerank for product image search[2] Results Similarity graph generated from the top1000 search results of “Mona-Lisa.” The largest two images contain the highest rank. Results References [1] H. H. Taher, "Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search," IEEE Transactions on Knowledge and Data Engineering, vol. 15, pp. 784-796, 2003. [2] Y. Jing and S. Baluja, "Pagerank for product image search," in Proceeding of the 17th international conference on World Wide Web, Beijing, China, 2008, pp. 307-316. HITS(Hyperlink Induced Topic Search) 从设计思想来看,HITS和PageRank的一个基本 区别是HITS针对具体查询、应用在查询时间,而 PageRank是独立于查询的 Root set, R(q): 和查询q相关的网页集合 Base set, V(q): 除了R(q)外,还包括指向R(q)元 素和被R(q)元素指向的网页 Expanded set = V - R 两个概念(直觉上有意义) AUTHORITY(权威型网页):内容权威,质量高的网 页 HUB(目录型网页):指向许多authority网页的网页 Authority and Hub scores 针对u∈V(q),在每个网页u上定义有两个参数: a[u]和h[u],分别表示其权威性和目录性。 交叉定义 一个网页u的a值依赖于指向它的网页v的h值 一个网页u的h值依赖于它所指的网页v的a值 a E h T h E a E E h T HITS (contd.) 声望高的(入度大) 权威性高 认识许多声望高的(出度大)目录性强 如何计算? Power Iteration on: a E h E Ea T T h Ea EE h T HITS算法过程(Topic Distillation) 1. 2. 3. 4. Send query to a textbased IR system and obtain the root-set. Expand the root-set by radius one to obtain an expanded graph. Run power iterations on the hub and authority scores together. Report top-ranking authorities and hubs. PageRank & HITS 网页的相互链接特性,使得我们可以应用社会网络 分析的方法来从网页集合的结构中提炼有用的信息 提炼什么信息?取决于我们对应用目标的认识,还 取决于有关技术模型的精确定义、有效计算,以及 对可能产生误差的认识和评估 PageRank和HITS是两个经典的例子,大目标一致, 切入点不同。 它们不应该意味着不可能有更好的(或者更有特色的) 新角度 它们的一个本质缺陷是:将网页之间的链接关系“太当 真”。 思考题 1. PageRank 的必要性 假如没有 PageRank 会出现什么问题? 2. PageRank VS. AnchorText 为什么你能准确查找到你想要的网页? Web Spam 检测 Web Spam 是当前搜索引擎面临的一个最主要挑战 什么是 web spam? Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value Spam = web pages that are the result of spamming This is a very broad defintion SEO industry might disagree! SEO = search engine optimization Approximately 10-15% of web pages are spam Web Spam 的分类 Term spamming Manipulating the text of web pages in order to appear relevant to queries Link spamming Creating link structures that boost page rank or hubs and authorities scores Term Spamming Repetition Dumping of a large number of unrelated terms e.g., copy entire dictionaries Weaving of one or a few specific terms e.g., free, cheap, viagra Goal is to subvert TF.IDF ranking schemes Copy legitimate pages and insert spam terms at random positions Phrase Stitching Glue together sentences and phrases from different sources Link Spamming Three kinds of web pages from a spammer’s point of view Inaccessible pages Accessible pages e.g., web log comments pages spammer can post links to his pages Own pages Completely controlled by spammer May span multiple domain names Link Farms Spammer’s goal Maximize the page rank of target page t Technique Get as many links from accessible pages as possible to target page t Construct “link farm” to get page rank multiplier effect 在论坛、Blog 等发表评论之类指向自己的网站 交换链接 honey pot Link Farms 模型 Accessible Own 1 Inaccessible t 2 M One of the most common and effective organizations for a link farm Link Farms 模型分析 Own Accessible Inaccessibl e t 1 2 M Suppose rank contributed by accessible pages = x Let page rank of target page = y Rank of each “farm” page = (1-)y/M + /N y = x + (1-)M[(1- )y/M + /N] + /N Very small; ignore = x + (1-)2y + (1-)M/N + /N y = x/(2-2) + cM/N where c = (1-)/(2-) Link Farms 模型分析 Own Accessible Inaccessibl e t 1 2 M y = x/(2-2) + cM/N where c = (1-/(2-) For = 0.15, 1/(2-2)= 3.6, c = 0.46 Multiplier effect for “acquired” page rank By making M large, we can make y as large as we want Link Farms 的后果 造成大量无效信息 造成大量的虚假信息 对信息分析产生不利影响 对搜集系统产生严重影响 产生大量网页 产生大量链接 产生大量域名 产生搜集陷阱 占用大量资源 对 PageRank 产生严重影响! 检测 Web Spam 的方法 Term spamming Analyze text using statistical methods e.g., Naïve Bayes classifiers Similar to email spam filtering Also useful: detecting approximate duplicate pages Link spamming 基于统计的检测方法 链接分析的检测方法: TrustRank,BadRank 基于统计的 Web Spam 检测 Nature seems to create bell curves (range around an average) Human activity seems to create power laws (popularity skewing) Number of Web page in-links (Broder+) More Examples frequency of words protein-interaction degree distribution Internet (AS) degree distribution severity of inter-state wars severity of terrorist attacks frequency of bird sightings size of blackouts book sales population of US cities size of religions number of citations papers authored popularity of surnames number of web hits number of web links, with cut-off number of phone calls size of email address book number of species per genus 检测方法 其基本依据是:Spam 网页往往是机器自动生成的 ,它们与自然生成的网页是存在区别的,那么在统 计特征上往往表现为异常点,求解其交集得到的网 页就很可能是 Spam 网页 统计点: URL 特征、域名解析 出入度、网页内容 演化过程、集群特征 Web page out-degrees There are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Web page in-degrees There are 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected Spammers are studious! 基于链接分析的 Web Spam 检测 41 TrustRank 思想 Basic principle: approximate isolation It is rare for a “good” page to point to a “bad” (spam) page Sample a set of “seed pages” from the web Have an oracle (human) identify the good pages and the spam pages in the seed set Expensive task, so must make seed set as small as possible 生成种子集 (seed set) The notion of human checking of a web page is represented by Oracle function: 0 if p is bad, O( p) 1 if p is good. Oracle invocations are expensive, one should strive to minimize them To evaluate the pages without calling O, it is necessary to estimate the probability that p is good The Trust function yields a range of values between 0 (bad) and 1 (good) Ideally, T ( p) Pr[O( p) 1]. 信任传播 (Trust propagation) Expecting that good pages point to other good pages, all pages reachable from a good seed page in M or fewer steps are denoted as good 1 2 3 4 good page 5 7 6 bad page S = {1, 3, 6} set of seed pages M = 1..3 maximum length path 1 2 3 5 6 4 7 Rules for trust propagation Trust attenuation The degree of trust conferred by a trusted page decreases with distance Trust splitting The larger the number of outlinks from a page, the less scrutiny the page author gives each outlink Trust is “split” across outlinks Trust attenuation We cannot be absolutely sure that pages reachable from good seeds are indeed good Further away we are from good seed, less certain we are that a page is good Trust dampening β – dampening factor β Trust splitting Can be combined β 1 β 2 1/2 t(1)=1 t(2)=1 1 2 5/12 1/2 1/3 1/3 1/3 β2 β2 3 3 5/12 t(3)=5/6 Simple model Suppose trust of page p is t(p) For each q in O(p), p confers the trust Trust of p is the sum of the trust conferred on p by all its inlinked pages Note similarity to Topic-Specific Page Rank t(p)/|O(p)| for 0<<1 Trust is additive Set of outlinks O(p) Within a scaling factor, trust rank = biased page rank with trusted pages as teleport set t= · LT · t + (1- · d / |d| TrustRank in Action Select seed set using inversed PageRank s=[2, 4, 5, 1, 3, 6, 7] 0 Invoke L(=3) oracle functions 1 Populate static score distribution vector d=[0, 1, 0, 1, 0, 0, 0] Normalize distribution vector 0.15 d=[0, 1/2, 0, 1/2, 0, 0, 0] 4 Calculate TrustRank scores using biased PageRank with trust dampening and trust splitting 0.05 RESULTS [0, 0.18, 0.12, 0.15, 0.13, 7 0.05, 0.05] 0.12 2 0.18 3 0.13 5 6 0.05 选择最优种子集 (seed set) Two conflicting considerations Human has to inspect each seed page, so seed set must be as small as possible Must ensure every “good page” gets adequate trust rank, so need make all good pages reachable from seed set by short paths Approaches to picking seed set Suppose we want to pick a seed set of k pages PageRank Pick the top k pages by page rank Assume high page rank pages are close to other highly ranked pages We care more about high page rank “good” pages Inverse page rank Pick the pages with the maximum number of outlinks Can make it recursive Formalize as “inverse page rank” Pick pages that link to pages with many outlinks Construct graph G’ by reversing each edge in web graph G Page Rank in G’ is inverse page rank in G Pick top k pages by inverse page rank 思考题:TrustRank VS. PageRank 是否可以使用 TrustRank 代替 PageRank? 优点是什么? 缺点是什么? 假如 Google 一开始就采用 TrustRank,还会不会 出现 Link Farms 问题? 思考题:TrustRank + PageRank 一种有效的 Link Farms 检测方法 首先计算一个正常的 PageRank 值 接下来计算一个 TrustRank 值 然后将 PageRank 减去 TrustRank 最后发现数值大的网页就是 Link Farms 的目标网页 为什么? 参考论文 “Link spam detection based on mass estimation” 请思考:怎么结合 TrustRank 和 PageRank 获得一 个更合理的 Ranking? 本次课小结 PageRank 的应用 HITS 算法 Topic sensitive:偏向性 的 Ranking Image Ranking:将选 择问题转化为排序问题 AUTHORITY 和 HUB Web Spam 检测 Link Farms 模型 基于统计的检测方法 TrustRank Thank You! Q&A HOME WORK 请论证 Topic sensitive 与 TrustRank 是否等价? References: [1] Z. Gyongyi, H. Garcia-Molina and J.Pedersen. Combating Web Spam with TrustRank. Tech. rep., Stanford University, 2004. [2] Z. Gyongyi and H. Garcia-Molina. Seed selection in TrustRank. Tech. rep., Stanford University, 2004.