Lecture 6: (Week 6) Web (Social) Structure Mining

advertisement
The ppt is taken from Xiaotie Deng
Lecture 6: (Week 6)
Web (Social) Structure Mining
•
•
•
•
•
•
1.
2.
Hypertext Transfer Protocol
Hyperlink structures of the web
Relevance of closely linked web pages
In-degree as a measure of popularity
Physical Structure of Internet
Interesting websites:
http://prj61/GoogleTest3/GoogleSearch.aspx
http://www.google.com.hk/
2016/5/29
Dr. Wang
1
The ppt is taken from Xiaotie Deng
Hyperlinks
• The Hypertext Transfer Protocol (HTTP/1.0)
– An application-level protocol
• with the lightness and speed necessary for distributed, collaborative,
hypermedia information systems.
– A generic, stateless, object-oriented protocol
• which can be used for many tasks such as name servers and distributed
object management systems, through extension of its request methods
(commands).
– An important feature: the typing of data representation allows systems
to be built independently of the data being transferred
– Based on request/response paradigm
– In use by the WWW global information initiative since 1990.
• http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1945.txt
2016/5/29
Dr. Wang
2
The ppt is taken from Xiaotie Deng
HTML (Hypertext Markedup Language)
• HTML - the Hypertext Markup Language - is the lingua
franca for publishing on the World Wide Web. Having
gone through several stages of evolution, today's
HTML has a wide range of features reflecting the
needs of a very diverse and international community
wishing to make information available on the Web.
– see: http://www.w3.org/MarkUp/
2016/5/29
Dr. Wang
3
The ppt is taken from Xiaotie Deng
Web Structure
• Directed Virtual Links
• Special Features: Huge, Unknown, Dynamic
• Directed Graph Functions: backlink, shortest
path, cliques
2016/5/29
Dr. Wang
4
The ppt is taken from Xiaotie Deng
An Unknown Digraph
• Hyperlinks pointed to other web pages from each web
page form a virtual directed graph
• Hyperlinks are added and deleted at will by individual
web page authors
– Web pages may not know their incoming hyperlinks
– The Digraph is dynamic:
– Central control of the hyperlinks are not possible
2016/5/29
Dr. Wang
5
The ppt is taken from Xiaotie Deng
The digraph is dynamic
• Search engines can only map a fraction of the whole
web space
• Even if we can manage the size of the digraph, its
dynamic nature requires constant update of the map
• There are web pages that are not documented by any
search engine.
2016/5/29
Dr. Wang
6
The ppt is taken from Xiaotie Deng
The structure of the Digraph
• Nodes: web pages (URL)
• Directed Edges: hyperlink that from one web page to
another
• Content of a node: the content contained in its
associate web page
• dynamic nature of the digraph:
– For some nodes, there are outgoing edges which we don’t
know yet.
• Nodes that not yet processed
• new edges (hyperlinks) may have added to some nodes
– For all the nodes, there are some incoming edges which we do
2016/5/29 not yet know.
Dr. Wang
7
The ppt is taken from Xiaotie Deng
Useful Functions of the
Digraph
• Backlink (the_url):
– find out all the urls that point to the_url
• Shortest_path (url1, url2):
– return the shortest path from url1 to url2
• Maximal clique (url):
– return a maximal clique that contains url
2016/5/29
Dr. Wang
8
The ppt is taken from Xiaotie Deng
V0
V1
V3
V2
V4
V5
V6
An ordinary digraph H with 7 vertices and 12 edges
2016/5/29
Dr. Wang
9
The ppt is taken from Xiaotie Deng
V0
V1
2
4
10
1
3
V3
V2
2
5
2
8
V4
4
6
1
V5
V6
A partially unknown digraph H with 7 vertices and 12 edges
but node v5 is not yet explored. We don’t know the outgoing
edges from v5 though we know its existence (by its URL).
2016/5/29
Dr. Wang
10
The ppt is taken from Xiaotie Deng
Map of the hyperlink space
• To construct it, one needs
– a spider to automatically collect URLs
– a graph class to store information for nodes(URLs) and links
(hyperlinks)
• The whole digraph (URLs,HYPERLINKs) is huge:
– 162,128,493 hosts (2002 Jul data from
http://www.isc.org/ds/WWW-200207/index.html )
– One may need graph algorithms with secondary memories
2016/5/29
Dr. Wang
11
The ppt is taken from Xiaotie Deng
Spiders
• Automatically Retrieve web pages
–
–
–
–
Start with an URL
retrieve the associated web page
Find all URLs on the web page
recursively retrieve not-yet searched URLs
• Algorithmic Issues
– How to choose the next URL?
– Avoid overloaded sub-networks
The ppt is taken from Xiaotie Deng
Spider Architecture
Shared URL pool
Add a new
URL
Http Request
url_spider
url_spider
url_spider
url_spider
url_spider
spiders
Get an
URL
Http Response
Web Space
Database Interface
2016/5/29
DatabaseDr. Wang
13
The ppt is taken from Xiaotie Deng
The spider
• Does that automatically (without
clicking on a line nor type a URL)
• It is an automated program that search
the web.
– Read a web page
– store/index the relevant information on the
page
– follow all the links on the page (and repeat
the above for each link)
2016/5/29
Dr. Wang
14
The ppt is taken from Xiaotie Deng
Internet Growth Charts
2016/5/29
Dr. Wang
15
The ppt is taken from Xiaotie Deng
Internet Growth Charts
2016/5/29
Dr. Wang
16
The ppt is taken from Xiaotie Deng
Partial Map
• Partial map may be enough for some purposes
– e.g., we are often interested in a small portion of the whole
Internet
• A partial map may be constructed within the memory
space for an ordinary PC. Therefore it may allow fast
performance.
• However, we may not be able to collect all information
that are necessary for our purpose
– e.g., back links to a URL
2016/5/29
Dr. Wang
17
The ppt is taken from Xiaotie Deng
Back link
• Hyperlinks on the web are forward types.
• One web page may not know the hyperlinks that point
to itself
– authors of web pages can freely link to other documents in the
cyperspace without consent from their authors
• Back links may be of value
– in scientific literature, SCI (Scientific Citation Index) is an
important index for judgement of academic value of one
academic article
• www.isinet.com
• It is not easy to find all the back links
2016/5/29
Dr. Wang
18
The ppt is taken from Xiaotie Deng
Discovery of Back links
• Provided in Advanced Search Features of Several
Search Engines
– Search from google is done by typing
• link:url
– as your keyword
– Example: 1. go to www.google.com and
2: link:http://www.cs.cityu.edu.hk/~lwang
– homework: find how to retrieval back links from other
search engines
– http://decweb.ethz.ch/WWW7/1938/com1938.htm
• Surfing the Web Backwards
– www8.org/w8-papers/5b-hypertext-media/surfing/surfing.html
2016/5/29
Dr. Wang
19
The ppt is taken from Xiaotie Deng
Web structure mining
• Some information is embedded in the digraph
– Usually, the hyperlinks from a web page to other web pages are
chosen because they are important and contain useful related
information in the view of the web page author
– e.g., fans of football may all have links pointing to their favorite
foot teams
• Some basic technology tools for web structure mining:
–
–
–
–
2016/5/29
a spider
a graph class
some relevant graph algorithms
a back link retrieval function
Dr. Wang
20
The ppt is taken from Xiaotie Deng
Intellectual Content of Hyperlink Structure
• Fans
• Page Rank
• Densely Connected Subgraphs as Communities
2016/5/29
Dr. Wang
21
The ppt is taken from Xiaotie Deng
FAN:
• Fans of a web page often put a link toward the
web page.
• It is usually done manually after a user have
accessed the web page and looked at the
content of the web page.
2016/5/29
Dr. Wang
22
The ppt is taken from Xiaotie Deng
fans
a web page
Fans of a web page
2016/5/29
Dr. Wang
23
The ppt is taken from Xiaotie Deng
FANs as an indicator of popularity:
• The more a web page’s fans are, the more
popular it is.
• SCI (Scientific Citation Index), for example, is a
well established method to rate the importance
of a research paper published in international
journals.
– It is somewhat controversial for importance since
some important work may not be popular
– But it is a good indicator of the influences
2016/5/29
Dr. Wang
24
The ppt is taken from Xiaotie Deng
An Objection to FANs as an indicator of
importance:
• Some of the more popular web pages are so
well known that people may not put them in their
web pages
• On the assumption that some web pages are
more important than others, How to compare
– a web page linked to by important web pages
– another linked to by less important web pages?
2016/5/29
Dr. Wang
25
The ppt is taken from Xiaotie Deng
PageRank
• Two influencing factors for the rank of a web page:
– The rank of web pages pointing to it
• The high the better
– The number of links in the web pages
pointing out
• The less the better
2016/5/29
Dr. Wang
26
The ppt is taken from Xiaotie Deng
Definition
• Webpages are ranked according to their page ranks calculated as
follows:
– Assume page A has pages T1...Tn which point to it (i.e., back links or
citations).
– Choose a parameter d that is a damping factor which can be set
between 0 and 1 (usually set d to 0.85)
– C(A) is defined as the number of links going out of page A. The
PageRank of a page A is given as follows:
• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
2016/5/29
Dr. Wang
27
The ppt is taken from Xiaotie Deng
Calculation of PageRank
• Notice that the definition of PR(A) is cyclic.
– I.e., ranks of web pages are used to calculate the ranks of
web pages,
• However, PageRank or PR(A) can be calculated using
a simple iterative algorithm, and corresponds to the
principal eigenvector of the normalized link matrix of
the web.
• It is reported that a PageRank for 26 million web
pages can be computed in a few hours on a medium
size workstation.
2016/5/29
Dr. Wang
28
The ppt is taken from Xiaotie Deng
An example
b
a
c
2016/5/29
Dr. Wang
29
The ppt is taken from Xiaotie Deng
PageRank of example graph
• Start with PR(a)=1, PR(b)=1, PR(c) =1
– Apply PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
• For simplicity, set d=1, and recale that C(): outdegree
• After first iteration, we have
– PR(a)=1, PR(b)=1/2, PR( c) =3/2
• For the second iteration, we have
– PR(a)=3/2, PR(b)=1/2, PR( c)=1
• Subsequent iterations:
– a:1 b:3/4 c:5/4
– a:5/4 b:1/2 c:5/4
• in the limit
2016/5/29
Dr. Wang
– PR(a)=6/5, PR(b)=3/5, PR( c)=6/5
30
The ppt is taken from Xiaotie Deng
An example
b: C(b)=1
PR(b)=1
a:C(a)=2
PR(a)=1
PR(c)=1
c: C(c)=1
UPDATE:
PR(a)=PR( c)/ C (c ) =1
PR(b) = PR(a)/C(a)=1/2
PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/2
2016/5/29
Dr. Wang
31
The ppt is taken from Xiaotie Deng
An example
b: C(b)=1
PR(b)=1/2
a:C(a)=2
PR(a)=1
c: C(c)=1
PR(c)=3/2
UPDATE:
PR(a)=PR( c)/ C (c ) =3/2
PR(b) = PR(a)/C(a)=1/2
PR( c )=PR(a)/C(a)+PR(b)/C(b)=1/2+1/2=1
2016/5/29
Dr. Wang
32
The ppt is taken from Xiaotie Deng
An example
b: C(b)=1
PR(b)=1/2
a:C(a)=2
PR(a)=3/2
c: C(c)=1
PR(c)=1
UPDATE:
PR(a)=PR( c)/ C (c ) =1
PR(b) = PR(a)/C(a)=3/4
PR( c )=PR(a)/C(a)+PR(b)/C(b)=3/4+1/2=5/4
2016/5/29
Dr. Wang
33
The ppt is taken from Xiaotie Deng
Bringing Order to the Web
• Used maps containing as many as 518 million of these hyperlinks.
• These maps allow rapid calculation of a web page's "PageRank",
an objective measure of its citation importance that corresponds
well with people's subjective idea of importance.
• For most popular subjects, a simple text matching search that is
restricted to web page titles performs admirably when PageRank
prioritizes the results (demo available at google.stanford.edu).
• For the type of full text searches in the main Google system,
PageRank also helps a great deal.
• As reported in “The Anatomy of a Large-Scale Hypertextual Web
Search Engine”, by Sergey Brin and Lawrence Page
2016/5/29
Dr. Wang
34
The ppt is taken from Xiaotie Deng
Do Densely Connected Sub-graphs
represent Web Sub-communities?
• Inferring Web Communities from Link Topology
– http://citeseer.nj.nec.com/36254.html
• Efficient Identification of Web Communities
– http://citeseer.nj.nec.com/flake00efficient.html
• Friends and Neighbors on the Web
– http://citeseer.nj.nec.com/adamic01friends.html
2016/5/29
Dr. Wang
35
The ppt is taken from Xiaotie Deng
An idea: Complete sub-graphs
• there is a group of URLs such that
– each URL has a link to every other URL in
the group
• This is an evidence that each author of
the web page is interested in every other
web pages in the sub-group
2016/5/29
Dr. Wang
36
The ppt is taken from Xiaotie Deng
Another idea: Complete bipartite sub-graphs
• Complete Bipartite graph:
– two groups of nodes, U and V
– for each node u in U and each node v in V
• there is an edge from u to v
• References
– D. Gibson J. Kleinberg, and Raghavan. Inferring web
communities from link topology, In Proc. 9th ACM Conference
on Hypertext and Hypermedia, 1998.
– T. Murata. Finding Related Web Pages based on connectivity
information from a search engine. The 10 th WWW
Conference
2016/5/29
Dr. Wang
37
The ppt is taken from Xiaotie Deng
Problem Description
• Suppose one is familiar with some Web
pages of specific topic, such as, sports
• Problem: to find more pages about the
same topic
• Web community: entity of related web
pages ( centers )
2016/5/29
Dr. Wang
38
The ppt is taken from Xiaotie Deng
Search of fans using a search
engine
• Use input URLs as initial centers
• Search URLs referring to all the centers
by backlink search from the centers
• Fixed number of high-ranking URLs are
selected as fans
2016/5/29
Dr. Wang
39
The ppt is taken from Xiaotie Deng
Adding a new URL to centers
• Acquire fans’ HTML files through internet
• Extract hyperlinks in the HTML files
• Sort the hyperlinks in order of frequency
• Add Top-ranking hyperlink to centers
• Delete fans not referring to all the centers
2016/5/29
Dr. Wang
40
The ppt is taken from Xiaotie Deng
Web Community
• Repeat previous steps until few fans left
• Acquired centers are regarded as a WEB
COMMUNITY
2016/5/29
Dr. Wang
41
The ppt is taken from Xiaotie Deng
Web community
fans
centers
Web Community Centers: many web pages go there
2016/5/29
Dr. Wang
42
The ppt is taken from Xiaotie Deng
Drawbacks
• Maximum clique is difficult to find (NP-hard problem)
• The rough idea of closely linked URLs is right but
completely connected subgraph may not be. It is
often the case that some links may be missing
– fans may not have hyperlinks to centers that are created
after their web pages are created
2016/5/29
Dr. Wang
43
The ppt is taken from Xiaotie Deng
Minimum Cut Paradigm
• Maximum clique is difficult to find (NP-hard problem)
• The rough idea of closely linked URLs is right but
completely connected subgraph may not be. It is often
the case that some links may be missing
– fans may not have hyperlinks to centers that are created after
their web pages are created
• A minimum cut of a digraph (V,A) is a partition of the
node set V into two subsets U and W such that the
number of edges from U to W is minimized.
– It captures the notion of U and W are NOT closely linked.
– Therefore, nodes in U are more closely related than with
2016/5/29 nodes in W.
Dr. Wang
44
The ppt is taken from Xiaotie Deng
General approach
• Find a min-cut using maximum flow algorithm
• if the minimum cut is sufficiently large, keep it and
report the nodes as a web community
• else
• remove the edges associated with the minimum cut to
split the digraph into two connected components
• repeat on each of the two connected component
2016/5/29
Dr. Wang
45
The ppt is taken from Xiaotie Deng
a
b
c
j
d
i
e
f
h
g
2016/5/29
Dr. Wang
46
The ppt is taken from Xiaotie Deng
a
b
c
j
d
e
f
h
g
2016/5/29
Dr. Wang
47
The ppt is taken from Xiaotie Deng
a
b
c
j
d
e
f
g
2016/5/29
Dr. Wang
48
The ppt is taken from Xiaotie Deng
Efficient Identification of
Web Community
•A heuristic implementation of the minimum cut
paradigm for web community
–Gray William Flake, Steve Lawrence, and C. Lee
Giles
–Proceedings of the 6th International Conference on
Knowledge Discovery and Data Mining (ACM
SIGKDD2000) pp.150-160, August 2000, Boston,
USA
2016/5/29
Dr. Wang
49
The ppt is taken from Xiaotie Deng
Problem Description
• Given some web pages,
• Problem:
pages.
find a community of related
• Community: a set of web pages that link (in
either direction) to more web pages in the
community than to pages outside the
community
2016/5/29
Dr. Wang
50
The ppt is taken from Xiaotie Deng
Methodology: Maximum Flow
and Minimal Cuts
• Maximum Flow: Digraph G = (V, E),
capacity function c(u, v), source sV,
sink t V,
• Pl:
find the maximum flow
from s to t, obeying all capacity
v
constraints.
u 1
s
1
2
A flow
s->t:
2016/5/29
1
11
3
1
1
t
3
1
1
Dr. Wang
1
51
The ppt is taken from Xiaotie Deng
Methodology: Maximum Flow
and Minimal Cuts
• Cut: a set of edges the removing of which
will separate s and t
• Minimal cut: cut with minimal weight
1
1
s
t
3
1
A cut of weight 6
2016/5/29
Dr. Wang
52
The ppt is taken from Xiaotie Deng
Methodology: Maximum Flow
and Minimal Cut
• Maximum flow and minimal cut theory:
•
the maximum flow of the network is
identical to the minimum cut that
separates s and t.
2016/5/29
Dr. Wang
53
The ppt is taken from Xiaotie Deng
Community to minimum cut
• Theorem: A community, C, can be identified by
calculating the s-t minimum cut of G with s and t being
used as the source and sink, provided both s# and t#
exceed the cut set size. After the cut, vertices that are
reachable from s are in the community.
• s#: # of edges between s and (C-s)
• t#: # of edges between t and (V-C-s)
• Note: compute the minimum cut by computing the
maximum flow (existed polynomial algorithms)
2016/5/29
Dr. Wang
54
The ppt is taken from Xiaotie Deng
Algorithm details: the initial
graph
• (a): A virtual source s, linking to all
• (b): Input k seed web pages, find all pages (c) that link to(backlink
from Altavista) or from (extract from the HTML file) the seed set;
• Download their HTML files and all outbound links (d), linking to
• A virtual sink t (e).
• Along each link there is an edge
• Edges between vertices in (b) and (c) are bidirectional, others one
way
•2016/5/29
Capacity of edges from (a) to Dr.
(b)Wang
and from (d) to (e) is sufficiently
large; others one
55
The ppt is taken from Xiaotie Deng
Initial Graph
…………….
(c)
….
(b)
……
……….
…………….
(a)
(d)
(e)
(a): Virtual source vertex; (b): seed web sites; (c):web sites link to
or from seeds; (d): references to sites not in (b) nor in (c); (e): virtual sink
2016/5/29
Dr. Wang
56
The ppt is taken from Xiaotie Deng
Algorithm details:
Procedure focused-crawl (graph G =(V, E); vertex s, t V)
While # of iteration is less than desired do:
Perform maximum flow analysis of G, yielding community , C.
Identify non-seed vertex, v*C, with the highest in-degree relative to
G.
•
for all v C with in-degree equal to v*,
•
Add v to seed set
•
Add edge (s, v) to E with infinite capacity
•
end for
•
Identify non-seed vertex, u*, with the highest out-degree relative to
G
•
for all uC with out-degree equal to u*,
•
Add u to seed set
•
Add edge (s, u) to E with infinite capacity
•
end for
•
Re-crawl so that G uses all seeds
•
Let G reflect new information from the crawl
• End while
2016/5/29
Dr. Wang
57
• End procudure
•
•
•
•
The ppt is taken from Xiaotie Deng
In-degree of a node in digraph
• In-degree of a node is the number of links that point to
the node
• In-degree of a URL represents, to some extent, its
popularity.
– The more a web page is pointed to by web pages, the more
web authors are interested in
–
2016/5/29
Dr. Wang
58
The ppt is taken from Xiaotie Deng
Internet Physical Structures
• The hyperlink structure is a social one in the sense
that the links are chosen by web page authors out of
their own preferences
• The physical structure also exhibits interesting
properties that are useful for efficient routing
algorithms.
• Interested readers may refer to:
– H. Burch and B. Cheswick. Mapping the Internet. In IEEE Computer,
April 1999.
– P. Francis S. Jamin C. Jin Y. Jin D. Raz Y. Shavitt L. Zhang, IDMaps:
A Global Internet Host Distance Estimation Service, IEEE/ACM
Transactions on Networking, October2001
2016/5/29
Dr. Wang
59
Download