Link Farms 模型 - 北京大学网络与信息系统研究所

Link Analysis & Spam Detection http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 09/24/2013 PageRank Why and how it works? “Random Walker”模型    设想有一个永不休止、在网上浏览网页的人,随机选择一个链出的链接继续访问。我们问，在稳态情况下（足够长时间后），他会正在看哪一篇网页呢？等价于：稳态情况下，每个网页v会有一个被访问的概率，p(v)，它可以作为网页的重要程度的度量。我们可以合理地设想：此时到达v的概率，依赖于上一个时刻到达“链向”v的网页的概率，以及那些网页中超链的个数。 Google Matrix  让这浏览者每次以一定的概率（1-β）沿着超链走，以概率（β）重新随机选择一个新的起始节点  这在物理意义上即总是有可能跳进入度为0的点，跳出那些“圈”。在模型表达上即为    T p  (1   ) L p  1N  p   (1   ) L  (1N )  p N N   T  β选在0.1和0.2之间，被称作damping factor(Page & Brin 1997） G=(1-β)LT+ β/N(1N) 被称为Google Matrix Google Matrix特征向量求解     Power Iteration方法：给定Google Matrix G，记|λ1| ≥|λ2| ≥…,q1 是属于λ1的特征向量初始化向量p0，使得||p0||1=1 对于k = 1, 2, …,执行如下步骤    x = Gpk-1， pk = x/||x||1，基本迭代规格化步骤可以证明（收敛速度）  |pk – q1| = O(|λ2/λ1|k)    T pi 1   (1   ) L  (1N )  pi N   例子（power iteration） 1 / 11      1/ 2   L         1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11 1 / 11  1   1  1/ 2   1/ 3 1/ 3 1/ 3   1/ 2 1/ 2  1/ 2 1/ 2   1/ 2 1/ 2  1/ 2 1/ 2   1  1  小规模数据求解     β取0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T P1=GP0 You can try this in MatLab  ...  。。。。。。。    Power Iteration求解得(迭代50次) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T Applications of PageRank Algorithm Topic-Sensitive PageRank[1] ODP-Biasing Sqd   P(c j q')  rankjd j    T pi 1   (1   ) L  (1N )  pi N   Results Precision @ 10 results for test queries Ranking preferred by majority of users Pagerank for product image search[2] Results  Similarity graph generated from the top1000 search results of “Mona-Lisa.” The largest two images contain the highest rank. Results References   [1] H. H. Taher, "Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search," IEEE Transactions on Knowledge and Data Engineering, vol. 15, pp. 784-796, 2003. [2] Y. Jing and S. Baluja, "Pagerank for product image search," in Proceeding of the 17th international conference on World Wide Web, Beijing, China, 2008, pp. 307-316. HITS(Hyperlink Induced Topic Search)      从设计思想来看，HITS和PageRank的一个基本区别是HITS针对具体查询、应用在查询时间，而 PageRank是独立于查询的 Root set, R(q): 和查询q相关的网页集合 Base set, V(q): 除了R(q)外，还包括指向R(q)元素和被R(q)元素指向的网页 Expanded set = V - R 两个概念（直觉上有意义）   AUTHORITY（权威型网页）：内容权威，质量高的网页 HUB（目录型网页）：指向许多authority网页的网页 Authority and Hub scores   针对u∈V(q)，在每个网页u上定义有两个参数： a[u]和h[u]，分别表示其权威性和目录性。交叉定义   一个网页u的a值依赖于指向它的网页v的h值一个网页u的h值依赖于它所指的网页v的a值 a  E h T h  E a  E E h T HITS (contd.)    声望高的（入度大） 权威性高认识许多声望高的（出度大）目录性强如何计算？ Power Iteration on: a  E h  E Ea T T h  Ea  EE h T HITS算法过程（Topic Distillation） 1. 2. 3. 4. Send query to a textbased IR system and obtain the root-set. Expand the root-set by radius one to obtain an expanded graph. Run power iterations on the hub and authority scores together. Report top-ranking authorities and hubs. PageRank & HITS    网页的相互链接特性，使得我们可以应用社会网络分析的方法来从网页集合的结构中提炼有用的信息提炼什么信息？取决于我们对应用目标的认识，还取决于有关技术模型的精确定义、有效计算，以及对可能产生误差的认识和评估 PageRank和HITS是两个经典的例子，大目标一致，切入点不同。   它们不应该意味着不可能有更好的（或者更有特色的）新角度它们的一个本质缺陷是：将网页之间的链接关系“太当真”。思考题  1. PageRank 的必要性   假如没有 PageRank 会出现什么问题？ 2. PageRank VS. AnchorText  为什么你能准确查找到你想要的网页？ Web Spam 检测 Web Spam 是当前搜索引擎面临的一个最主要挑战什么是 web spam?    Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value Spam = web pages that are the result of spamming This is a very broad defintion    SEO industry might disagree! SEO = search engine optimization Approximately 10-15% of web pages are spam Web Spam 的分类  Term spamming   Manipulating the text of web pages in order to appear relevant to queries Link spamming  Creating link structures that boost page rank or hubs and authorities scores Term Spamming  Repetition    Dumping    of a large number of unrelated terms e.g., copy entire dictionaries Weaving   of one or a few specific terms e.g., free, cheap, viagra Goal is to subvert TF.IDF ranking schemes Copy legitimate pages and insert spam terms at random positions Phrase Stitching  Glue together sentences and phrases from different sources Link Spamming  Three kinds of web pages from a spammer’s point of view    Inaccessible pages Accessible pages  e.g., web log comments pages  spammer can post links to his pages Own pages  Completely controlled by spammer  May span multiple domain names Link Farms  Spammer’s goal   Maximize the page rank of target page t Technique      Get as many links from accessible pages as possible to target page t Construct “link farm” to get page rank multiplier effect 在论坛、Blog 等发表评论之类指向自己的网站交换链接 honey pot Link Farms 模型 Accessible Own 1 Inaccessible t 2 M One of the most common and effective organizations for a link farm Link Farms 模型分析 Own Accessible Inaccessibl e t 1 2 M Suppose rank contributed by accessible pages = x Let page rank of target page = y Rank of each “farm” page = (1-)y/M + /N y = x + (1-)M[(1- )y/M + /N] + /N Very small; ignore = x + (1-)2y + (1-)M/N + /N y = x/(2-2) + cM/N where c = (1-)/(2-) Link Farms 模型分析 Own Accessible Inaccessibl e t 1 2 M   y = x/(2-2) + cM/N where c = (1-/(2-) For  = 0.15, 1/(2-2)= 3.6, c = 0.46  Multiplier effect for “acquired” page rank  By making M large, we can make y as large as we want Link Farms 的后果  造成大量无效信息     造成大量的虚假信息   对信息分析产生不利影响对搜集系统产生严重影响    产生大量网页产生大量链接产生大量域名产生搜集陷阱占用大量资源对 PageRank 产生严重影响！检测 Web Spam 的方法  Term spamming     Analyze text using statistical methods e.g., Naïve Bayes classifiers Similar to email spam filtering Also useful: detecting approximate duplicate pages Link spamming   基于统计的检测方法链接分析的检测方法: TrustRank，BadRank 基于统计的 Web Spam 检测   Nature seems to create bell curves (range around an average) Human activity seems to create power laws (popularity skewing) Number of Web page in-links (Broder+) More Examples frequency of words protein-interaction degree distribution Internet (AS) degree distribution severity of inter-state wars severity of terrorist attacks frequency of bird sightings size of blackouts book sales population of US cities size of religions number of citations papers authored popularity of surnames number of web hits number of web links, with cut-off number of phone calls size of email address book number of species per genus 检测方法   其基本依据是：Spam 网页往往是机器自动生成的，它们与自然生成的网页是存在区别的，那么在统计特征上往往表现为异常点，求解其交集得到的网页就很可能是 Spam 网页统计点：    URL 特征、域名解析出入度、网页内容演化过程、集群特征 Web page out-degrees There are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Web page in-degrees There are 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected Spammers are studious! 基于链接分析的 Web Spam 检测 41 TrustRank 思想  Basic principle: approximate isolation    It is rare for a “good” page to point to a “bad” (spam) page Sample a set of “seed pages” from the web Have an oracle (human) identify the good pages and the spam pages in the seed set  Expensive task, so must make seed set as small as possible 生成种子集 (seed set)  The notion of human checking of a web page is represented by Oracle function: 0 if p is bad,  O( p)   1 if p is good.     Oracle invocations are expensive, one should strive to minimize them To evaluate the pages without calling O, it is necessary to estimate the probability that p is good The Trust function yields a range of values between 0 (bad) and 1 (good) Ideally, T ( p)  Pr[O( p)  1]. 信任传播 (Trust propagation)  Expecting that good pages point to other good pages, all pages reachable from a good seed page in M or fewer steps are denoted as good 1 2 3 4 good page 5 7 6 bad page  S = {1, 3, 6} set of seed pages M = 1..3 maximum length path 1 2 3 5 6 4 7 Rules for trust propagation  Trust attenuation   The degree of trust conferred by a trusted page decreases with distance Trust splitting   The larger the number of outlinks from a page, the less scrutiny the page author gives each outlink Trust is “split” across outlinks Trust attenuation    We cannot be absolutely sure that pages reachable from good seeds are indeed good Further away we are from good seed, less certain we are that a page is good Trust dampening β – dampening factor   β Trust splitting Can be combined β 1 β 2 1/2 t(1)=1 t(2)=1 1 2 5/12 1/2 1/3 1/3 1/3 β2 β2 3 3 5/12 t(3)=5/6 Simple model  Suppose trust of page p is t(p)   For each q in O(p), p confers the trust   Trust of p is the sum of the trust conferred on p by all its inlinked pages Note similarity to Topic-Specific Page Rank   t(p)/|O(p)| for 0<<1 Trust is additive   Set of outlinks O(p) Within a scaling factor, trust rank = biased page rank with trusted pages as teleport set t=  · LT · t + (1- · d / |d| TrustRank in Action        Select seed set using inversed PageRank s=[2, 4, 5, 1, 3, 6, 7] 0 Invoke L(=3) oracle functions 1 Populate static score distribution vector d=[0, 1, 0, 1, 0, 0, 0] Normalize distribution vector 0.15 d=[0, 1/2, 0, 1/2, 0, 0, 0] 4 Calculate TrustRank scores using biased PageRank with trust dampening and trust splitting 0.05 RESULTS [0, 0.18, 0.12, 0.15, 0.13, 7 0.05, 0.05] 0.12 2 0.18 3 0.13 5 6 0.05 选择最优种子集 (seed set)  Two conflicting considerations   Human has to inspect each seed page, so seed set must be as small as possible Must ensure every “good page” gets adequate trust rank, so need make all good pages reachable from seed set by short paths Approaches to picking seed set   Suppose we want to pick a seed set of k pages PageRank    Pick the top k pages by page rank Assume high page rank pages are close to other highly ranked pages We care more about high page rank “good” pages Inverse page rank   Pick the pages with the maximum number of outlinks Can make it recursive   Formalize as “inverse page rank”    Pick pages that link to pages with many outlinks Construct graph G’ by reversing each edge in web graph G Page Rank in G’ is inverse page rank in G Pick top k pages by inverse page rank 思考题：TrustRank VS. PageRank     是否可以使用 TrustRank 代替 PageRank？优点是什么？缺点是什么？假如 Google 一开始就采用 TrustRank，还会不会出现 Link Farms 问题？思考题：TrustRank + PageRank  一种有效的 Link Farms 检测方法        首先计算一个正常的 PageRank 值接下来计算一个 TrustRank 值然后将 PageRank 减去 TrustRank 最后发现数值大的网页就是 Link Farms 的目标网页为什么？参考论文 “Link spam detection based on mass estimation” 请思考：怎么结合 TrustRank 和 PageRank 获得一个更合理的 Ranking？本次课小结  PageRank 的应用    HITS 算法   Topic sensitive：偏向性的 Ranking Image Ranking：将选择问题转化为排序问题 AUTHORITY 和 HUB Web Spam 检测    Link Farms 模型基于统计的检测方法 TrustRank Thank You! Q&A HOME WORK  请论证 Topic sensitive 与 TrustRank 是否等价？  References：   [1] Z. Gyongyi, H. Garcia-Molina and J.Pedersen. Combating Web Spam with TrustRank. Tech. rep., Stanford University, 2004. [2] Z. Gyongyi and H. Garcia-Molina. Seed selection in TrustRank. Tech. rep., Stanford University, 2004.

Link Farms 模型 - 北京大学网络与信息系统研究所

Related documents

Products

Support

Link Farms 模型 - 北京大学网络与信息系统研究所

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib