The ppt is taken from Xiaotie Deng Lecture 6: (Week 6) Web (Social) Structure Mining • • • • • • 1. 2. Hypertext Transfer Protocol Hyperlink structures of the web Relevance of closely linked web pages In-degree as a measure of popularity Physical Structure of Internet Interesting websites: http://prj61/GoogleTest3/GoogleSearch.aspx http://www.google.com.hk/ 2016/5/29 Dr. Wang 1 The ppt is taken from Xiaotie Deng Hyperlinks • The Hypertext Transfer Protocol (HTTP/1.0) – An application-level protocol • with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. – A generic, stateless, object-oriented protocol • which can be used for many tasks such as name servers and distributed object management systems, through extension of its request methods (commands). – An important feature: the typing of data representation allows systems to be built independently of the data being transferred – Based on request/response paradigm – In use by the WWW global information initiative since 1990. • http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.txt 2016/5/29 Dr. Wang 2 The ppt is taken from Xiaotie Deng HTML (Hypertext Markedup Language) • HTML - the Hypertext Markup Language - is the lingua franca for publishing on the World Wide Web. Having gone through several stages of evolution, today's HTML has a wide range of features reflecting the needs of a very diverse and international community wishing to make information available on the Web. – see: http://www.w3.org/MarkUp/ 2016/5/29 Dr. Wang 3 The ppt is taken from Xiaotie Deng Web Structure • Directed Virtual Links • Special Features: Huge, Unknown, Dynamic • Directed Graph Functions: backlink, shortest path, cliques 2016/5/29 Dr. Wang 4 The ppt is taken from Xiaotie Deng An Unknown Digraph • Hyperlinks pointed to other web pages from each web page form a virtual directed graph • Hyperlinks are added and deleted at will by individual web page authors – Web pages may not know their incoming hyperlinks – The Digraph is dynamic: – Central control of the hyperlinks are not possible 2016/5/29 Dr. Wang 5 The ppt is taken from Xiaotie Deng The digraph is dynamic • Search engines can only map a fraction of the whole web space • Even if we can manage the size of the digraph, its dynamic nature requires constant update of the map • There are web pages that are not documented by any search engine. 2016/5/29 Dr. Wang 6 The ppt is taken from Xiaotie Deng The structure of the Digraph • Nodes: web pages (URL) • Directed Edges: hyperlink that from one web page to another • Content of a node: the content contained in its associate web page • dynamic nature of the digraph: – For some nodes, there are outgoing edges which we don’t know yet. • Nodes that not yet processed • new edges (hyperlinks) may have added to some nodes – For all the nodes, there are some incoming edges which we do 2016/5/29 not yet know. Dr. Wang 7 The ppt is taken from Xiaotie Deng Useful Functions of the Digraph • Backlink (the_url): – find out all the urls that point to the_url • Shortest_path (url1, url2): – return the shortest path from url1 to url2 • Maximal clique (url): – return a maximal clique that contains url 2016/5/29 Dr. Wang 8 The ppt is taken from Xiaotie Deng V0 V1 V3 V2 V4 V5 V6 An ordinary digraph H with 7 vertices and 12 edges 2016/5/29 Dr. Wang 9 The ppt is taken from Xiaotie Deng V0 V1 2 4 10 1 3 V3 V2 2 5 2 8 V4 4 6 1 V5 V6 A partially unknown digraph H with 7 vertices and 12 edges but node v5 is not yet explored. We don’t know the outgoing edges from v5 though we know its existence (by its URL). 2016/5/29 Dr. Wang 10 The ppt is taken from Xiaotie Deng Map of the hyperlink space • To construct it, one needs – a spider to automatically collect URLs – a graph class to store information for nodes(URLs) and links (hyperlinks) • The whole digraph (URLs,HYPERLINKs) is huge: – 162,128,493 hosts (2002 Jul data from http://www.isc.org/ds/WWW-200207/index.html ) – One may need graph algorithms with secondary memories 2016/5/29 Dr. Wang 11 The ppt is taken from Xiaotie Deng Spiders • Automatically Retrieve web pages – – – – Start with an URL retrieve the associated web page Find all URLs on the web page recursively retrieve not-yet searched URLs • Algorithmic Issues – How to choose the next URL? – Avoid overloaded sub-networks The ppt is taken from Xiaotie Deng Spider Architecture Shared URL pool Add a new URL Http Request url_spider url_spider url_spider url_spider url_spider spiders Get an URL Http Response Web Space Database Interface 2016/5/29 DatabaseDr. Wang 13 The ppt is taken from Xiaotie Deng The spider • Does that automatically (without clicking on a line nor type a URL) • It is an automated program that search the web. – Read a web page – store/index the relevant information on the page – follow all the links on the page (and repeat the above for each link) 2016/5/29 Dr. Wang 14 The ppt is taken from Xiaotie Deng Internet Growth Charts 2016/5/29 Dr. Wang 15 The ppt is taken from Xiaotie Deng Internet Growth Charts 2016/5/29 Dr. Wang 16 The ppt is taken from Xiaotie Deng Partial Map • Partial map may be enough for some purposes – e.g., we are often interested in a small portion of the whole Internet • A partial map may be constructed within the memory space for an ordinary PC. Therefore it may allow fast performance. • However, we may not be able to collect all information that are necessary for our purpose – e.g., back links to a URL 2016/5/29 Dr. Wang 17 The ppt is taken from Xiaotie Deng Back link • Hyperlinks on the web are forward types. • One web page may not know the hyperlinks that point to itself – authors of web pages can freely link to other documents in the cyperspace without consent from their authors • Back links may be of value – in scientific literature, SCI (Scientific Citation Index) is an important index for judgement of academic value of one academic article • www.isinet.com • It is not easy to find all the back links 2016/5/29 Dr. Wang 18 The ppt is taken from Xiaotie Deng Discovery of Back links • Provided in Advanced Search Features of Several Search Engines – Search from google is done by typing • link:url – as your keyword – Example: 1. go to www.google.com and 2: link:http://www.cs.cityu.edu.hk/~lwang – homework: find how to retrieval back links from other search engines – http://decweb.ethz.ch/WWW7/1938/com1938.htm • Surfing the Web Backwards – www8.org/w8-papers/5b-hypertext-media/surfing/surfing.html 2016/5/29 Dr. Wang 19 The ppt is taken from Xiaotie Deng Web structure mining • Some information is embedded in the digraph – Usually, the hyperlinks from a web page to other web pages are chosen because they are important and contain useful related information in the view of the web page author – e.g., fans of football may all have links pointing to their favorite foot teams • Some basic technology tools for web structure mining: – – – – 2016/5/29 a spider a graph class some relevant graph algorithms a back link retrieval function Dr. Wang 20 The ppt is taken from Xiaotie Deng Intellectual Content of Hyperlink Structure • Fans • Page Rank • Densely Connected Subgraphs as Communities 2016/5/29 Dr. Wang 21 The ppt is taken from Xiaotie Deng FAN: • Fans of a web page often put a link toward the web page. • It is usually done manually after a user have accessed the web page and looked at the content of the web page. 2016/5/29 Dr. Wang 22 The ppt is taken from Xiaotie Deng fans a web page Fans of a web page 2016/5/29 Dr. Wang 23 The ppt is taken from Xiaotie Deng FANs as an indicator of popularity: • The more a web page’s fans are, the more popular it is. • SCI (Scientific Citation Index), for example, is a well established method to rate the importance of a research paper published in international journals. – It is somewhat controversial for importance since some important work may not be popular – But it is a good indicator of the influences 2016/5/29 Dr. Wang 24 The ppt is taken from Xiaotie Deng An Objection to FANs as an indicator of importance: • Some of the more popular web pages are so well known that people may not put them in their web pages • On the assumption that some web pages are more important than others, How to compare – a web page linked to by important web pages – another linked to by less important web pages? 2016/5/29 Dr. Wang 25 The ppt is taken from Xiaotie Deng PageRank • Two influencing factors for the rank of a web page: – The rank of web pages pointing to it • The high the better – The number of links in the web pages pointing out • The less the better 2016/5/29 Dr. Wang 26 The ppt is taken from Xiaotie Deng Definition • Webpages are ranked according to their page ranks calculated as follows: – Assume page A has pages T1...Tn which point to it (i.e., back links or citations). – Choose a parameter d that is a damping factor which can be set between 0 and 1 (usually set d to 0.85) – C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) 2016/5/29 Dr. Wang 27 The ppt is taken from Xiaotie Deng Calculation of PageRank • Notice that the definition of PR(A) is cyclic. – I.e., ranks of web pages are used to calculate the ranks of web pages, • However, PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. • It is reported that a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. 2016/5/29 Dr. Wang 28 The ppt is taken from Xiaotie Deng An example b a c 2016/5/29 Dr. Wang 29 The ppt is taken from Xiaotie Deng PageRank of example graph • Start with PR(a)=1, PR(b)=1, PR(c) =1 – Apply PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • For simplicity, set d=1, and recale that C(): outdegree • After first iteration, we have – PR(a)=1, PR(b)=1/2, PR( c) =3/2 • For the second iteration, we have – PR(a)=3/2, PR(b)=1/2, PR( c)=1 • Subsequent iterations: – a:1 b:3/4 c:5/4 – a:5/4 b:1/2 c:5/4 • in the limit 2016/5/29 Dr. Wang – PR(a)=6/5, PR(b)=3/5, PR( c)=6/5 30 The ppt is taken from Xiaotie Deng An example b: C(b)=1 PR(b)=1 a:C(a)=2 PR(a)=1 PR(c)=1 c: C(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/2 2016/5/29 Dr. Wang 31 The ppt is taken from Xiaotie Deng An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=1 c: C(c)=1 PR(c)=3/2 UPDATE: PR(a)=PR( c)/ C (c ) =3/2 PR(b) = PR(a)/C(a)=1/2 PR( c )=PR(a)/C(a)+PR(b)/C(b)=1/2+1/2=1 2016/5/29 Dr. Wang 32 The ppt is taken from Xiaotie Deng An example b: C(b)=1 PR(b)=1/2 a:C(a)=2 PR(a)=3/2 c: C(c)=1 PR(c)=1 UPDATE: PR(a)=PR( c)/ C (c ) =1 PR(b) = PR(a)/C(a)=3/4 PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/4+1/2=5/4 2016/5/29 Dr. Wang 33 The ppt is taken from Xiaotie Deng Bringing Order to the Web • Used maps containing as many as 518 million of these hyperlinks. • These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. • For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at google.stanford.edu). • For the type of full text searches in the main Google system, PageRank also helps a great deal. • As reported in “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, by Sergey Brin and Lawrence Page 2016/5/29 Dr. Wang 34 The ppt is taken from Xiaotie Deng Do Densely Connected Sub-graphs represent Web Sub-communities? • Inferring Web Communities from Link Topology – http://citeseer.nj.nec.com/36254.html • Efficient Identification of Web Communities – http://citeseer.nj.nec.com/flake00efficient.html • Friends and Neighbors on the Web – http://citeseer.nj.nec.com/adamic01friends.html 2016/5/29 Dr. Wang 35 The ppt is taken from Xiaotie Deng An idea: Complete sub-graphs • there is a group of URLs such that – each URL has a link to every other URL in the group • This is an evidence that each author of the web page is interested in every other web pages in the sub-group 2016/5/29 Dr. Wang 36 The ppt is taken from Xiaotie Deng Another idea: Complete bipartite sub-graphs • Complete Bipartite graph: – two groups of nodes, U and V – for each node u in U and each node v in V • there is an edge from u to v • References – D. Gibson J. Kleinberg, and Raghavan. Inferring web communities from link topology, In Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998. – T. Murata. Finding Related Web Pages based on connectivity information from a search engine. The 10 th WWW Conference 2016/5/29 Dr. Wang 37 The ppt is taken from Xiaotie Deng Problem Description • Suppose one is familiar with some Web pages of specific topic, such as, sports • Problem: to find more pages about the same topic • Web community: entity of related web pages ( centers ) 2016/5/29 Dr. Wang 38 The ppt is taken from Xiaotie Deng Search of fans using a search engine • Use input URLs as initial centers • Search URLs referring to all the centers by backlink search from the centers • Fixed number of high-ranking URLs are selected as fans 2016/5/29 Dr. Wang 39 The ppt is taken from Xiaotie Deng Adding a new URL to centers • Acquire fans’ HTML files through internet • Extract hyperlinks in the HTML files • Sort the hyperlinks in order of frequency • Add Top-ranking hyperlink to centers • Delete fans not referring to all the centers 2016/5/29 Dr. Wang 40 The ppt is taken from Xiaotie Deng Web Community • Repeat previous steps until few fans left • Acquired centers are regarded as a WEB COMMUNITY 2016/5/29 Dr. Wang 41 The ppt is taken from Xiaotie Deng Web community fans centers Web Community Centers: many web pages go there 2016/5/29 Dr. Wang 42 The ppt is taken from Xiaotie Deng Drawbacks • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing – fans may not have hyperlinks to centers that are created after their web pages are created 2016/5/29 Dr. Wang 43 The ppt is taken from Xiaotie Deng Minimum Cut Paradigm • Maximum clique is difficult to find (NP-hard problem) • The rough idea of closely linked URLs is right but completely connected subgraph may not be. It is often the case that some links may be missing – fans may not have hyperlinks to centers that are created after their web pages are created • A minimum cut of a digraph (V,A) is a partition of the node set V into two subsets U and W such that the number of edges from U to W is minimized. – It captures the notion of U and W are NOT closely linked. – Therefore, nodes in U are more closely related than with 2016/5/29 nodes in W. Dr. Wang 44 The ppt is taken from Xiaotie Deng General approach • Find a min-cut using maximum flow algorithm • if the minimum cut is sufficiently large, keep it and report the nodes as a web community • else • remove the edges associated with the minimum cut to split the digraph into two connected components • repeat on each of the two connected component 2016/5/29 Dr. Wang 45 The ppt is taken from Xiaotie Deng a b c j d i e f h g 2016/5/29 Dr. Wang 46 The ppt is taken from Xiaotie Deng a b c j d e f h g 2016/5/29 Dr. Wang 47 The ppt is taken from Xiaotie Deng a b c j d e f g 2016/5/29 Dr. Wang 48 The ppt is taken from Xiaotie Deng Efficient Identification of Web Community •A heuristic implementation of the minimum cut paradigm for web community –Gray William Flake, Steve Lawrence, and C. Lee Giles –Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD2000) pp.150-160, August 2000, Boston, USA 2016/5/29 Dr. Wang 49 The ppt is taken from Xiaotie Deng Problem Description • Given some web pages, • Problem: pages. find a community of related • Community: a set of web pages that link (in either direction) to more web pages in the community than to pages outside the community 2016/5/29 Dr. Wang 50 The ppt is taken from Xiaotie Deng Methodology: Maximum Flow and Minimal Cuts • Maximum Flow: Digraph G = (V, E), capacity function c(u, v), source sV, sink t V, • Pl: find the maximum flow from s to t, obeying all capacity v constraints. u 1 s 1 2 A flow s->t: 2016/5/29 1 11 3 1 1 t 3 1 1 Dr. Wang 1 51 The ppt is taken from Xiaotie Deng Methodology: Maximum Flow and Minimal Cuts • Cut: a set of edges the removing of which will separate s and t • Minimal cut: cut with minimal weight 1 1 s t 3 1 A cut of weight 6 2016/5/29 Dr. Wang 52 The ppt is taken from Xiaotie Deng Methodology: Maximum Flow and Minimal Cut • Maximum flow and minimal cut theory: • the maximum flow of the network is identical to the minimum cut that separates s and t. 2016/5/29 Dr. Wang 53 The ppt is taken from Xiaotie Deng Community to minimum cut • Theorem: A community, C, can be identified by calculating the s-t minimum cut of G with s and t being used as the source and sink, provided both s# and t# exceed the cut set size. After the cut, vertices that are reachable from s are in the community. • s#: # of edges between s and (C-s) • t#: # of edges between t and (V-C-s) • Note: compute the minimum cut by computing the maximum flow (existed polynomial algorithms) 2016/5/29 Dr. Wang 54 The ppt is taken from Xiaotie Deng Algorithm details: the initial graph • (a): A virtual source s, linking to all • (b): Input k seed web pages, find all pages (c) that link to(backlink from Altavista) or from (extract from the HTML file) the seed set; • Download their HTML files and all outbound links (d), linking to • A virtual sink t (e). • Along each link there is an edge • Edges between vertices in (b) and (c) are bidirectional, others one way •2016/5/29 Capacity of edges from (a) to Dr. (b)Wang and from (d) to (e) is sufficiently large; others one 55 The ppt is taken from Xiaotie Deng Initial Graph ……………. (c) …. (b) …… ………. ……………. (a) (d) (e) (a): Virtual source vertex; (b): seed web sites; (c):web sites link to or from seeds; (d): references to sites not in (b) nor in (c); (e): virtual sink 2016/5/29 Dr. Wang 56 The ppt is taken from Xiaotie Deng Algorithm details: Procedure focused-crawl (graph G =(V, E); vertex s, t V) While # of iteration is less than desired do: Perform maximum flow analysis of G, yielding community , C. Identify non-seed vertex, v*C, with the highest in-degree relative to G. • for all v C with in-degree equal to v*, • Add v to seed set • Add edge (s, v) to E with infinite capacity • end for • Identify non-seed vertex, u*, with the highest out-degree relative to G • for all uC with out-degree equal to u*, • Add u to seed set • Add edge (s, u) to E with infinite capacity • end for • Re-crawl so that G uses all seeds • Let G reflect new information from the crawl • End while 2016/5/29 Dr. Wang 57 • End procudure • • • • The ppt is taken from Xiaotie Deng In-degree of a node in digraph • In-degree of a node is the number of links that point to the node • In-degree of a URL represents, to some extent, its popularity. – The more a web page is pointed to by web pages, the more web authors are interested in – 2016/5/29 Dr. Wang 58 The ppt is taken from Xiaotie Deng Internet Physical Structures • The hyperlink structure is a social one in the sense that the links are chosen by web page authors out of their own preferences • The physical structure also exhibits interesting properties that are useful for efficient routing algorithms. • Interested readers may refer to: – H. Burch and B. Cheswick. Mapping the Internet. In IEEE Computer, April 1999. – P. Francis S. Jamin C. Jin Y. Jin D. Raz Y. Shavitt L. Zhang, IDMaps: A Global Internet Host Distance Estimation Service, IEEE/ACM Transactions on Networking, October2001 2016/5/29 Dr. Wang 59