Presentation in TeleCom ParisTech XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November 22 2010 Research experience Associate Professor: Renmin University of China XML data management, Cloud data management, Approximate search Post-doc: University of California, Irvine Data integration, Approximate string match PhD National University of Singapore XML data management Outline XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing Approximate string matching Approximate string search Approximate member extraction XML twig query processing XPath: Section[Title]/Paragraph//Figure Twig pattern Section Title Paragraph Figure XML twig query processing (Cont.) Problem Statement Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D. E.g. Consider Query and Document: Document: t1 t2 s1 Query: Section Query solutions: title (s1, t1, f1) (s2, t2, f1) (s1, t2, f1) s2 p1 f1 figure Previous work: TwigStack TwigStack [1] is a holistic algorithm for XML twig matching on containment labeling scheme. Two steps in TwigStack : (1) intermediate path solutions are output to match each query root-to-leaf path; and (2) these intermediate path solutions are merged to get the final results. [1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002. Running example: TwigStack algorithm State of stacks: Query: Data streams: s t f Output path intermediate solutions: s//t: s//f: Final results: s (1,12,1) (4,11,2) t (2,3,2) (5,6,3) f (8,9,4) (1,12,1) (2,3,2) (1,12,1) (8,9,4) (1,12,1) (2,3,2) (8,9,4) (1,12,1) (5,6,3) (4,11,2) (8,9,4) (1,12,1) (5,6,3) (8,9,4) (4,11,2) (5,6,3) (4,11,2) (5,6,3) (8,9,4) Limitations of TwigStack (1) TwigStack may output many useless intermediate results for queries with parent-child relationship (2) TwigStack cannot process XML twig queries with ordered predicates, like “Proceeding”, “Following” in XPath (3) TwigStack cannot answer queries with wildcards in branching nodes. E.g. * B C The parent of B should be an ancestor of C XML twig query processing (Cont.) Several efficient pattern matching algorithms TJFast (VLDB 05)(citation: 173) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10) Motivation: new labeling scheme TwigStackList and iTwigJoin are all based on the containment labeling scheme Why not try Dewey labeling scheme for XML twig pattern query ? Oh, it is really a novel idea! Original Dewey Labeling Scheme In Dewey labeling scheme, each element is presented by an integer sequence: (i) the root is labeled by a empty stringε (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th childε of s. s1 For example: 1 2 t1 3 s2 f2 2.1 t2 2.2 f1 Main problem of the original Dewey If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms. Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone Modular function We need to know some schema information: DTD (Document Type Definitions ) or XML schema Given DTD information: book → author, title, chapter* Our solution: using modular function, we create a match between an element tag and an integer number. We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2; where, Xt is the last integerε of the label of tag t. Why not 3 as the original Dewey ? book The number of distinct tags under book 0 author 1 title 2 chapter 5 chapter Derive element tag From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthor mod 3 = 0 Xtitle mod 3 = 1 Xchapter modε3 = 2. book 0 author ? 1 title ? 2 chapter ? 5 chapter ? More examples for assigning labels Let us consider a more complicated DTD a → (b | c )*, d?, c+ We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 =2 (Why do we useε mod 3 instead of 4?) a 0 b 2 d 4 c 7 c Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: FST: DTD: book → author, title, chapter* Mod 3=0 chapter → (paragraph | section)* book Mod 3=1 title section → (paragraph | section)* Mod 2=0 chapter chapter author Mod 3=2 book Document: author Mod 2=1 chapter paragraph Mod 2=0 section Mod 2=1 title section section Question: Given a label 5.1.0, what is the corresponding path ? section paragraph Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: FST: DTD: book → author, title, chapter* Mod 3=0 chapter → (paragraph | section)* book Mod 3=1 title section → (paragraph | section)* Mod 2=0 Mod 3=2 book Document: chapter chapter author author chapter title section section paragraph Mod 2=0 Mod 2=1 section Mod 2=1 Following the above red path, we get 5.1.0 denotes : paragraph section book/ chapter/section/paragraph Two properties of extended Dewey Find Ancestor Label From a label of any element, we can derive the labels of its all ancestors. Find Ancestor Name From a label of any element, we can derive the tag names of its all ancestors. Two properties enable us to design a new and efficient algorithm for XML twig pattern matching. A new algorithm: TJFast For each node n in the query, there exists a corresponding input stream Tn. Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order. For each branching node b of twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? ) During any point of computing, the size of set Sb is bounded by the depth of the XML document. An example for TJFast algorithm Document: Root 0 a1 0.0 a2 d1 0.0.1 0.3 Query: { … 0.5 a3 d2 0.3.2 b1 A b2 d3 } D A set for the branching node A B 0.5.0 C 0.3.1 DTD: c1 0.3.2.1 c2 0.5.0.0 a -> a*,d*, b* b -> d*, c* TD: TC: 0.0.1 , 0.3.1, 0.5.0 0.3.2.1, 0.5.0.0 d -> c* Why are there only two streams? An example for TJFast algorithm Document: Root 0 a1 0.0 a2 d1 0.0.1 0.3 0.5 a3 d2 0.3.2 b1 D b2 d3 B C 0.5.0 0.3.1 derive 0.3.2.1 0.0.1 c2 0.5.0.0 a1/a2/d1 derive 0.3.2.1 TC: a1/a3/b1/c1 0.0.1 , 0.3.1, 0.5.0 0.3.2.1, 0.5.0.0 } A … c1 TD: { Query: By finite state transducer of extended Dewey labeling scheme An example for TJFast algorithm Document: Root 0 a1 0.0 a2 d1 0.0.1 TD: TC: 0.3 { Query: A … 0.5 a3 d2 0.3.2 b1 D b2 d3 0.5.0 } B C 0.3.1 c1 c2 0.3.2.1 0.5.0.0 0.0.1 , 0.3.1, 0.5.0 0.3.2.1, 0.5.0.0 Both a1 and a3 possibly involve in query answers. (Why not a2 ?) An example for TJFast algorithm Document: Root Query: 0 a1 0.0 a2 d1 0.0.1 0.3 A … 0.5 a3 d2 0.3.2 b1 D b2 {a1,a3} B C d3 0.5.0 0.3.1 c1 0.3.2.1 c2 0.5.0.0 Then we insert a1, a3 to the set, Output Path solutions: TD: TC: 0.0.1 , 0.3.1, 0.5.0 0.3.2.1, 0.5.0.0 A//D A/B//C (a1, d1) (a3, b1, c1) An example for TJFast algorithm Document: Root Query: 0 a1 0.0 a2 d1 0.0.1 0.3 … 0.5 a3 d2 0.3.2 b1 d3 0.0.1 , 0.3.1, 0.5.0 0.3.2.1, 0.5.0.0 B C 0.5.0 Move the cursor of TD from d1 to d2 0.3.2.1 TC: D b2 0.3.1 c1 TD: A {a1,a3} c2 0.5.0.0 Output Path solutions: A//D (a1, d1) (a1, d2) (a3, d2) A/B//C (a3, b1, c1) An example for TJFast algorithm Document: Root 0 a1 0.0 a2 d1 0.0.1 0.3 0.5 a3 d2 0.3.2 b1 D b2 d3 B C 0.5.0 0.3.1 0.3.2.1 TC: A {a1,a3} … c1 TD: Query: 0.0.1 , 0.3.1, 0.5.0 0.3.2.1, 0.5.0.0 c2 0.5.0.0 Move the cursor of stream TD from d2 to d3 Output Path solutions: A//D (a1, d1) (a1, d2) (a3, d2) (a1, d3) A/B//C (a3, b1, c1) An example for TJFast algorithm Root Document: Query: 0 a1 0.0 a2 d1 0.0.1 0.3 … 0.5 a3 d2 0.3.2 b1 0.3.2.1 TC: D b2 d3 0.0.1 , 0.3.1, 0.5.0 0.3.2.1, 0.5.0.0 c2 0.5.0.0 B C 0.5.0 0.3.1 c1 TD: A {a1,a3} Move the cursor of stream TC from c1 to c2 Output Path solutions: A//D (a1, d1) (a1, d2) (a3, d2) (a1, d3) A/B//C (a3, b1, c1) (a1, b2, c2) Sort and merge-join in TJFast Document: a1 A b2 a3 a2 Query: D d1 d2 b1 B d3 c1 c2 Phase 1. Intermediate paths A// D: A/B//C: <a1, d1>, <a1,b2, c2>, Join <a1, d2>, <a3, b1,c1> <a1, d3>, <a3, d2> C Phase 2. Final solutions <A, D, B,C> <a1,d1,b2,c2>,<a1,d2, b2,c2>, <a1,d3,b2,c2>,<a3,d2, b1,c1>, TJFast+L Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extending TJFast Two benefits of TJFast+L over TJFast reduce I/O cost by reading less elements enlarge optimal query classes Optimal query classes Optimal Class of TJFast Optimal Class of TJFast+L Only A-D in branching edges Only P-C in all edges A B A C D B C D XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204 Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189 Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119 Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309 Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178 Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263 Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298 Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466 …… Outline XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing 课题背景: XQuery vs. 关键字查询 XQuery: for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings Complicated Query papers by “Mike” Keyword search: Mike,inproceedings The proposed keyword search returns the set of smallest trees containing all keywords. bib Keywords: Mike hobby article Paper author author 2009 name Mike ward publications inproceedings title year articles title Base line of 2002 Information Retrival XML key hobby name publications Paper John folding Hopking inproceedings year 2002 title Data Mining year 2007 article title Keyword Search in XML hobby Read book year 2009 XML keyword search – Search intention identification – Query result retrieval – Result ranking – Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data – Detailed papers: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 (one of best papers to be invited in TKDE Journal) XML keyword search XML Keyword search Inspired by IR style keyword search on the web Enables user to access information in XML database XML data modeled as a rooted, labeled tree Recent research efforts Efficiency Effectiveness Effectiveness Capture user’s search intention Identify the target that user intends to search for Infer the predicate constraint that user intends to search via Result ranking Rank the query results according to their objective relevance to user search intention State of the Art Search semantics design LCA (Lowest Common Ancestor) Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K SLCA (Smallest LCA) Node v is a SLCA of keyword set K={w1, w2,…,wk} if (1) v is a LCA of K (2) no proper descendant of v is LCA of K XSeek Infers the search intention based on the concept of objects and an analysis of the matching between keyword and data node State of the Art (cont) Efficient result retrieval Designed based on a certain search semantics XKSearch, Multiway SLCA etc. Result ranking XRANK, XKSEarch, EASE They only consider Structural compactness of matching results Keyword proximity Similarity at node level Problems Unaddressed Not address the user search intention adequately! Meaningfulness of query result SLCA is less meaningful in many cases Keyword Ambiguity Problems 1. A keyword can appear both as an xml node type and as the text value of some other nodes 2. A keyword can appear in the text values of different xml node types and carry different meanings Neither SLCA nor Xseek can well address keyword ambiguity Problems——Keyword Ambiguity Q = “customer, interest, art” Ambiguity 1: customer, interest; Ambiguity 2: art Intention: find customer whose interest is art less relevant or irrelevant result to be returned also --storeDB customers C1,C3, B1’s title books ... customer ... interests ID name contact address interest 1” “C no. city 1”street “ ... ... customer customer ID interests name interests ID 3 ” interest “C name interest “Art Smith” 4” “C “rock music” “Rock Davis” “art” customer ... ... ... ID name “Mary Smith” “Art Street”“fashion”“C 2” purchases interests interest purchase “John Martin”“street art” book ... ... ID 1” “B publisher title authors ...author authorname 2” “B book “Edward Martin”“Oxford” authors ID title ... ... “Sophia Jones” author author “John Williams” “Art of Customer “Daniel Jones” Interest Care” Problems——Keyword Ambiguity (cont) Q = “customer, interest, art” “art” can be the value of interest node(C2, C4), name node(C3), or street node of customer(C1), or title node of book(B1) “customer” can be tag name of customer node, or (part of) value of title of(B1) storeDB - How to rank C1 to C4 and B1? customers books ... customer ... interests ID name contact address interest 1” “C no. city 1”street “ ... ... customer customer ID interests name interests ID interest 3” “C name interest “Art Smith” 4” “C “rock music” “Rock Davis” “art” customer ... ... ... ID name “Mary Smith” “Art Street”“fashion”“C 2” purchases interests interest purchase “John Martin”“street art” book ... ... ID 1” “B publisher title authors ...author authorname 2” “B book “Edward Martin”“Oxford” authors ID title ... ... “Sophia Jones” author author “John Williams” “Art of Customer “Daniel Jones” Interest Care” Objectives & Challenges • Address the below as a single problem – Search intention identification – Query result retrieval – Result ranking – Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data Challenges I. How to decide which sub-tree(s) with appropriate node types can capture user desired information II. How to return sub-trees of an appropriate size (i.e. contain enough but nonoverwhelming information) III. How to rank those sub-trees by their relevance Challenges Difficulty in applying TF*IDF to XML XML DB carries semantic information while text DB contains pure text information. XML TF*IDF must be aware of the underlying semantics. All contents of XML data are stored in leaf nodes only What is analogy of “flat document” in XML? o Sub-tree classified according to its prefix path Normalization factor is not simply the size of sub-tree o Structure of sub-trees may also infest the ranks Our Approach Extend IR-style keyword search techniques (like TF*IDF) from text database to XML database, in order to capture the hierarchical structure of xml document by analyzing the knowledge of statistics of underlying XML data Major Contributions 1. Identify user’s desired search-for node and search-via node(s) in a heuristic way Define XML TF (term frequency) and XML DF (document frequency) Confidence Formulas for search for/via candidates 2. Define XML TF*IDF Similarity Propose 3 guidelines specifically for xml keyword search Take keyword ambiguity problems into account 3. Design a Keyword Search Engine XReal XML keyword search Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109 Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010) Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754 Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537 Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716 …… Outline XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing Graphical and interactive XML search Auto-completion XML search Order-sensitive XML twig query XML query suggestion Demo online: http://datasearch.ruc.edu.cn:8080/LotusX/ Outline XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing Approximate string matching Approximate string search Approximate member extraction Motivation: Data Cleaning Should clearly be “Niels Bohr” Real-world data is dirty Typos Inconsistent representations (PO Box vs. P.O. Box) Approximately check against clean dictionary Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Motivation: Record Linkage We want to link records belonging to the same entity Phone … … … … … Age … … … … … Name Brad Pitt Arnold Schwarzeneger George Bush Angelina Jolie Forrest Whittaker No exact match! Name Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger The same entity may have similar representations Arnold Schwarzeneger Arnold Schwarzenegger versus Forrest Whittaker Forest Whittacker versus Hobbies … … … … … Address … … … … … Motivation: Query Relaxation Actual queries gathered by Google http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together What is Approximate String Search? Queries against collection: Find all entries similar to “Forrest Whitaker” Find all entries similar to “Arnold Schwarzenegger” Find all entries similar to “Brittany Spears” What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similaity - Dice - Etc. String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … … … The similar to predicate can help our described applications! How can we support these types of queries efficiently? Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine Sliding Window 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query Approximate Query Example Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne} Lookup Grams 2-grams Inverted Lists (stringIDs) … in tf vi ir ef rv 1 3 4 5 7 9 5 9 1 5 1 2 3 9 3 9 7 9 Each edit operations can “destroy” at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams ne un … 1 2 4 5 6 Candidates = {1, 5, 9} May have false positives Need to compute real similarity 5 6 9 T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold. Outline XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing Approximate string matching Approximate string search Approximate member extraction Introduction: An Example A dictionary of strings we are interested in E.g. product names, postal addresses… We are going to locate their “approximate occurrence” in documents. See the meaning of “approximate occurrence” in the following example: Problem Definition Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r ∈R, and Similarity (r, m) ≥δ(or Distance(r, m) ≤k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity: J (r , m) wt (r m) wt (r m) Why pre-pruning is needed We need evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-time-specific or filteringpower-specific? Less time or less survivors? The issue of compromise comes again Balance between the two stages should be reached: More(less) filtration time Strong(weak) filtration power Fewer(more) candidates Overall performance =Tf+Tv ????? Less(more) verification time State-of-the-art techniques ——K-signature scheme K-signature scheme Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have insufficient signature overlapping with m K is a parameter for filtration power tuning Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and k=∞ State-of-the-art techniques ——Inverted Signature-based Hashtable Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An ‘1’ for each occurrence of a <token,sig-token> tuple (‘1’- rectangle) Bitwise-or all solid matrices to get the matrix of R Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R. Formalized into an NPC problem Solution causes too weak filtering power Our proposed theorem If Sim(m,r) ≥δ, what do we have ? wt(Sig(m)∩Sig(r)) ≥ τ(m) wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) } So the threshold does not remain constant involves unknown evidence Our solution: Use inverted lists to count sigtoken overlappings. Note that sig-tokens usually have low document frequency (e.g. IDF as weights) Our algorithms and evaluations ——EvSCAN:Filtration by SIL Signature-based Inverted Lists (SLH) Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slr camera”, r3=“canon slr camera”}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9). 5d, 9.0 1 canon, 2.0 1 camera, 1.0 2 eos, 7.0 1 nikon, 2.0 2 slr, 2.0 2 3 3 Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for approximate member extraction using signature-based inverted lists. CIKM 2009:315-324 Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: SpaceConstrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615 Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266 Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739 …… Thank you Q&A