my presentation

advertisement
Presentation in TeleCom
ParisTech
XML data management and
approximate string matching
Jiaheng Lu
Key Lab of Data Engineering and Knowledge Engineering
Renmin University of China
November 22 2010
Research experience
 Associate Professor: Renmin University of
China
 XML data management, Cloud data
management, Approximate search
 Post-doc: University of California, Irvine
 Data integration, Approximate string match
 PhD National University of Singapore
 XML data management
Outline
 XML data management
 XML twig query processing
 XML keyword search
 Graphical and interactive XML query processing
 Approximate string matching
 Approximate string search
 Approximate member extraction
XML twig query processing
 XPath: Section[Title]/Paragraph//Figure
 Twig pattern
Section
Title
Paragraph
Figure
XML twig query processing (Cont.)
 Problem Statement
Given a query twig pattern Q, and an XML database D, we
need to compute ALL the answers to Q in D.
 E.g. Consider Query and Document:
Document:
t1
t2
s1
Query: Section
Query solutions:
title
(s1, t1, f1)
(s2, t2, f1)
(s1, t2, f1)
s2
p1
f1
figure
Previous work: TwigStack
 TwigStack [1] is a holistic algorithm for XML
twig matching on containment labeling
scheme.
 Two steps in TwigStack :
 (1) intermediate path solutions are output to
match each query root-to-leaf path; and
 (2) these intermediate path solutions are merged
to get the final results.
[1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal
xml pattern matching. In Proceedings of ACM SIGMOD, 2002.
Running example: TwigStack algorithm
State of stacks:
Query:
Data streams:
s
t
f
Output path intermediate solutions:
s//t:
s//f:
Final results:
s
(1,12,1) (4,11,2)
t
(2,3,2) (5,6,3)
f
(8,9,4)
(1,12,1) (2,3,2)
(1,12,1) (8,9,4)
(1,12,1) (2,3,2) (8,9,4)
(1,12,1) (5,6,3)
(4,11,2) (8,9,4)
(1,12,1) (5,6,3) (8,9,4)
(4,11,2) (5,6,3)
(4,11,2) (5,6,3) (8,9,4)
Limitations of TwigStack
 (1) TwigStack may output many useless intermediate
results for queries with parent-child relationship
 (2) TwigStack cannot process XML twig queries with
ordered predicates, like “Proceeding”, “Following” in
XPath
 (3) TwigStack cannot answer queries with wildcards
in branching nodes.
E.g.
*
B
C
The parent of B should be an
ancestor of C
XML twig query processing (Cont.)
 Several efficient pattern matching
algorithms




TJFast (VLDB 05)(citation: 173)
iTwigJoin (SIGMOD 05)
TwigStackList (CIKM 04)
TreeMatch (TKDE 10)
Motivation: new labeling scheme
 TwigStackList and iTwigJoin are all based on the
containment labeling scheme
Why not try Dewey
labeling scheme for XML
twig pattern query ?
Oh, it is really a novel
idea!
Original Dewey Labeling Scheme
 In Dewey labeling scheme, each element is
presented by an integer sequence:
 (i) the root is labeled by a empty stringε
 (ii) for a non-root element u, label(u)= label(s).x, where
u is the x-th childε of s.
s1
 For example:
1
2
t1
3
s2
f2
2.1
t2
2.2
f1
Main problem of the original Dewey
 If we use the original Dewey labeling scheme to
answer the twig query, we need to read labels for all
query node. Thus, this is not a better solution than
pervious algorithms.
Extend the original Dewey
labeling scheme so that given the
label of any element e, we can
know the path of e from this label
alone
Modular function
 We need to know some schema information: DTD (Document
Type Definitions ) or XML schema
 Given DTD information: book → author, title, chapter*
 Our solution: using modular function, we create a match between
an element tag and an integer number.
 We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2;
where, Xt is the last integerε of the label of tag t.
Why not 3 as the
original Dewey ?
book
The number of distinct
tags under book
0
author
1
title
2
chapter
5
chapter
Derive element tag
 From a label , we can derive its tag name.
 book → author, title, chapter*
 Recall that we define: Xauthor mod 3 = 0 Xtitle mod
3 = 1 Xchapter modε3 = 2.
book
0
author
?
1
title
?
2
chapter
?
5
chapter
?
More examples for assigning labels
 Let us consider a more complicated DTD
 a → (b | c )*, d?, c+
 We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3
=2
(Why do we useε mod 3 instead of 4?)
a
0
b
2
d
4
c
7
c
Derive the path from a label
 By following a finite state transducer (FST), we may recursively derive
the whole path from any extended Dewey label.
 For example:
FST:
DTD:
book → author, title, chapter*
Mod 3=0
chapter → (paragraph | section)*
book Mod 3=1
title
section → (paragraph | section)*
Mod 2=0
chapter
chapter
author
Mod 3=2
book
Document:
author
Mod 2=1
chapter
paragraph
Mod 2=0
section
Mod 2=1
title
section
section
Question: Given a label 5.1.0, what is the
corresponding path ?
section
paragraph
Derive the path from a label
 By following a finite state transducer (FST), we may recursively derive
the whole path from any extended Dewey label.
 For example:
FST:
DTD:
book → author, title, chapter*
Mod 3=0
chapter → (paragraph | section)*
book Mod 3=1
title
section → (paragraph | section)*
Mod 2=0
Mod 3=2
book
Document:
chapter
chapter
author
author
chapter
title
section
section
paragraph
Mod 2=0
Mod 2=1
section
Mod 2=1
Following the above red path, we get
5.1.0 denotes :
paragraph
section
book/ chapter/section/paragraph
Two properties of extended Dewey
 Find Ancestor Label
 From a label of any element, we can derive the labels
of its all ancestors.
 Find Ancestor Name
 From a label of any element, we can derive the tag
names of its all ancestors.
 Two properties enable us to design a new and efficient
algorithm for XML twig pattern matching.
A new algorithm: TJFast
 For each node n in the query, there exists a corresponding
input stream Tn.
 Tn contains the extended Dewey labels of elements of tag
n. Those labels are arranged by the document order.
 For each branching node b of twig pattern, there is a
corresponding set Sb, which contains elements possibly
involving query answers. (Compared to TwigStackList,
what difference? )
 During any point of computing, the size of set Sb is
bounded by the depth of the XML document.
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
0.3
Query:
{
…
0.5
a3
d2
0.3.2
b1
A
b2
d3
}
D
A set for the
branching node A
B
0.5.0
C
0.3.1
DTD:
c1
0.3.2.1
c2
0.5.0.0
a -> a*,d*, b*
b -> d*, c*
TD:
TC:
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
d -> c*
Why are there only two streams?
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
0.3
0.5
a3
d2
0.3.2
b1
D
b2
d3
B
C
0.5.0
0.3.1
derive
0.3.2.1
0.0.1
c2
0.5.0.0
a1/a2/d1
derive
0.3.2.1
TC:
a1/a3/b1/c1
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
}
A
…
c1
TD:
{
Query:
By finite state transducer of extended
Dewey labeling scheme
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
TD:
TC:
0.3
{
Query:
A
…
0.5
a3
d2
0.3.2
b1
D
b2
d3
0.5.0
}
B
C
0.3.1
c1
c2
0.3.2.1
0.5.0.0
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
Both a1 and a3 possibly involve in
query answers. (Why not a2 ?)
An example for TJFast algorithm
Document:
Root
Query:
0
a1
0.0
a2
d1
0.0.1
0.3
A
…
0.5
a3
d2
0.3.2
b1
D
b2
{a1,a3}
B
C
d3 0.5.0
0.3.1
c1
0.3.2.1
c2
0.5.0.0
Then we insert a1, a3 to the set,
Output Path solutions:
TD:
TC:
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
A//D
A/B//C
(a1, d1)
(a3, b1, c1)
An example for TJFast algorithm
Document:
Root
Query:
0
a1
0.0
a2
d1
0.0.1
0.3
…
0.5
a3
d2
0.3.2
b1
d3
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
B
C
0.5.0
Move the cursor of TD from d1 to d2
0.3.2.1
TC:
D
b2
0.3.1
c1
TD:
A {a1,a3}
c2
0.5.0.0
Output Path solutions:
A//D
(a1, d1)
(a1, d2)
(a3, d2)
A/B//C
(a3, b1, c1)
An example for TJFast algorithm
Document:
Root
0
a1
0.0
a2
d1
0.0.1
0.3
0.5
a3
d2
0.3.2
b1
D
b2
d3
B
C
0.5.0
0.3.1
0.3.2.1
TC:
A {a1,a3}
…
c1
TD:
Query:
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
c2
0.5.0.0
Move the cursor of stream TD from
d2 to d3
Output Path solutions:
A//D
(a1, d1)
(a1, d2)
(a3, d2)
(a1, d3)
A/B//C
(a3, b1, c1)
An example for TJFast algorithm
Root
Document:
Query:
0
a1
0.0
a2
d1
0.0.1
0.3
…
0.5
a3
d2
0.3.2
b1
0.3.2.1
TC:
D
b2
d3
0.0.1 , 0.3.1, 0.5.0
0.3.2.1, 0.5.0.0
c2
0.5.0.0
B
C
0.5.0
0.3.1
c1
TD:
A {a1,a3}
Move the cursor of stream TC from
c1 to c2
Output Path solutions:
A//D
(a1, d1)
(a1, d2)
(a3, d2)
(a1, d3)
A/B//C
(a3, b1, c1)
(a1, b2, c2)
Sort and merge-join in TJFast
Document:
a1
A
b2
a3
a2
Query:
D
d1
d2
b1
B
d3
c1
c2
Phase 1. Intermediate paths
A// D:
A/B//C:
<a1, d1>,
<a1,b2, c2>, Join
<a1, d2>,
<a3, b1,c1>
<a1, d3>,
<a3, d2>
C
Phase 2. Final solutions
<A, D, B,C>
<a1,d1,b2,c2>,<a1,d2, b2,c2>,
<a1,d3,b2,c2>,<a3,d2, b1,c1>,
TJFast+L
 Apply extended Dewey labeling
scheme on tag+level streaming scheme,
we propose TJFast+L algorithm by
extending TJFast
 Two benefits of TJFast+L over TJFast
 reduce I/O cost by reading less elements
 enlarge optimal query classes
Optimal query classes
Optimal Class of
TJFast
Optimal Class of
TJFast+L
Only A-D in
branching edges
Only P-C in all
edges
A
B
A
C
D
B
C
D
XML twig query processing










Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with
parent child edges: a look-ahead approach. CIKM 2004:533-542
Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To
Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB
2005:193-204
Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb
2004:180-189
Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig
pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119
Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of
Ordered XML Twig Pattern. DEXA 2005:300-309
Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing
- (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178
Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm
for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263
Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic
Approach to Optimize XML Query Processing. DASFAA 2008:282-298
Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern
Matching using Structural Indexing Techniques. SIGMOD 2005:455-466
……
Outline
 XML data management
 XML twig query processing
 XML keyword search
 Graphical and interactive XML query processing
课题背景: XQuery vs. 关键字查询
XQuery:
for $a in doc(“bib.xml”)//author
$n in $a/name
where $n=”Mike”
return $a//inproceedings
Complicated
Query papers by
“Mike”
Keyword search: 
Mike,inproceedings
 The proposed keyword search returns the set of
smallest trees containing all keywords.
bib
Keywords:
Mike hobby
article
Paper
author
author
2009
name
Mike
ward
publications
inproceedings
title
year
articles
title
Base line of 2002 Information
Retrival
XML key
hobby
name
publications
Paper
John
folding Hopking inproceedings
year
2002
title
Data
Mining
year
2007
article
title
Keyword
Search
in XML
hobby
Read
book
year
2009
XML keyword search
– Search intention identification
– Query result retrieval
– Result ranking
– Extend original TF*IDF from text database to XML database,
while capture the hierarchical structure of XML data
– Detailed papers: Effective XML Keyword Search with
Relevance Oriented Ranking. ICDE 2009:517-528
(one of best papers to be invited in TKDE Journal)
XML keyword search
 XML Keyword search
 Inspired by IR style keyword search on the web
 Enables user to access information in XML
database
 XML data modeled as a rooted, labeled tree
 Recent research efforts
 Efficiency
 Effectiveness
Effectiveness
Capture user’s search intention
Identify the target that user intends to search for
Infer the predicate constraint that user intends to search via
Result ranking
Rank the query results according to their objective
relevance to user search intention
State of the Art
 Search semantics design
 LCA (Lowest Common Ancestor)
 Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree
rooted at v contains at least one occurrence of all keywords in K,
after excluding the sub-elements that already contain all
keywords in K
 SLCA (Smallest LCA)
 Node v is a SLCA of keyword set K={w1, w2,…,wk} if
 (1) v is a LCA of K
 (2) no proper descendant of v is LCA of K
 XSeek
 Infers the search intention based on the concept of objects and an
analysis of the matching between keyword and data node
State of the Art (cont)
 Efficient result retrieval
 Designed based on a certain search semantics
 XKSearch, Multiway SLCA etc.
 Result ranking
 XRANK, XKSEarch, EASE
 They only consider
 Structural compactness of matching results
 Keyword proximity
 Similarity at node level
Problems Unaddressed
Not address the user search intention adequately!
 Meaningfulness of query result
SLCA is less meaningful in many cases
 Keyword Ambiguity Problems
1. A keyword can appear both as an xml node type and as the
text value of some other nodes
2. A keyword can appear in the text values of different xml
node types and carry different meanings
Neither SLCA nor Xseek can well address keyword ambiguity
Problems——Keyword Ambiguity
Q = “customer, interest, art”
 Ambiguity 1: customer, interest; Ambiguity 2: art
 Intention: find customer whose interest is art
 less relevant or irrelevant result to be returned also --storeDB
customers
C1,C3, B1’s title
books
...
customer
...
interests
ID
name
contact
address
interest
1”
“C
no.
city
1”street
“
...
...
customer
customer
ID
interests
name
interests
ID
3 ” interest
“C
name
interest
“Art Smith”
4”
“C
“rock music”
“Rock Davis” “art”
customer
...
...
...
ID name
“Mary Smith”
“Art Street”“fashion”“C
2”
purchases
interests
interest purchase
“John Martin”“street art”
book
...
...
ID
1”
“B
publisher
title authors
...author authorname
2”
“B
book
“Edward Martin”“Oxford”
authors
ID title
...
...
“Sophia Jones”
author author
“John Williams”
“Art of Customer
“Daniel Jones”
Interest Care”
Problems——Keyword Ambiguity (cont)
Q = “customer, interest, art”
 “art” can be the value of interest node(C2, C4), name node(C3), or street
node of customer(C1), or title node of book(B1)
 “customer” can be tag name of customer node, or (part of) value of title of(B1)
storeDB
- How to rank C1 to C4 and B1?
customers
books
...
customer
...
interests
ID
name
contact
address
interest
1”
“C
no.
city
1”street
“
...
...
customer
customer
ID
interests
name
interests
ID
interest
3”
“C
name
interest
“Art Smith”
4”
“C
“rock music”
“Rock Davis” “art”
customer
...
...
...
ID name
“Mary Smith”
“Art Street”“fashion”“C
2”
purchases
interests
interest purchase
“John Martin”“street art”
book
...
...
ID
1”
“B
publisher
title authors
...author authorname
2”
“B
book
“Edward Martin”“Oxford”
authors
ID title
...
...
“Sophia Jones”
author author
“John Williams”
“Art of Customer
“Daniel Jones”
Interest Care”
Objectives & Challenges
• Address the below as a single problem
– Search intention identification
– Query result retrieval
– Result ranking
– Extend original TF*IDF from text database to XML database,
while capture the hierarchical structure of XML data
Challenges
I. How to decide which sub-tree(s) with appropriate node types can capture
user desired information
II. How to return sub-trees of an appropriate size (i.e. contain enough but nonoverwhelming information)
III. How to rank those sub-trees by their relevance
Challenges
Difficulty in applying TF*IDF to XML
XML DB carries semantic information while text DB contains
pure text information. XML TF*IDF must be aware of the
underlying semantics.
All contents of XML data are stored in leaf nodes only
What is analogy of “flat document” in XML?
o Sub-tree classified according to its prefix path
Normalization factor is not simply the size of sub-tree
o Structure of sub-trees may also infest the ranks
Our Approach
 Extend IR-style keyword search techniques (like TF*IDF) from text
database to XML database, in order to capture the hierarchical structure
of xml document
 by analyzing the knowledge of statistics of underlying XML data
 Major Contributions
1. Identify user’s desired search-for node and search-via node(s) in a heuristic
way
 Define XML TF (term frequency) and XML DF (document frequency)
 Confidence Formulas for search for/via candidates
2. Define XML TF*IDF Similarity
 Propose 3 guidelines specifically for xml keyword search
 Take keyword ambiguity problems into account
3. Design a Keyword Search Engine XReal
XML keyword search








Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML
keyword searching. CIKM 2010:1933-1934
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An
Effective Object-Level XML Keyword Search. DASFAA 2010:93-109
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective
XML Keyword Search. TKDE, 22(8):1077-1092 (2010)
Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective
Ranked XML Keyword Search with Meaningful Result Display. DASFAA
2009:750-754
Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML
Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528
Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for
Effective Keyword Search in XML Documents. DASFAA 2008:529-537
Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in
XML Documents Based on MIU. DASFAA 2006:702-716
……
Outline
 XML data management
 XML twig query processing
 XML keyword search
 Graphical and interactive XML query processing
Graphical and interactive XML search
 Auto-completion XML search
 Order-sensitive XML twig query
 XML query suggestion
 Demo online:
http://datasearch.ruc.edu.cn:8080/LotusX/
Outline
 XML data management
 XML twig query processing
 XML keyword search
 XML Keyword refinement
 Graphical and interactive XML query processing
 Approximate string matching
 Approximate string search
 Approximate member extraction
Motivation: Data Cleaning
Should clearly be “Niels Bohr”

Real-world data is dirty

Typos

Inconsistent representations

(PO Box vs. P.O. Box)

Approximately check against
clean dictionary
Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008
Motivation: Record Linkage
We want to link records belonging to the same entity
Phone
…
…
…
…
…
Age
…
…
…
…
…
Name
Brad Pitt
Arnold Schwarzeneger
George Bush
Angelina Jolie
Forrest Whittaker
No exact
match!
Name
Brad Pitt
Forest Whittacker
George Bush
Angelina Jolie
Arnold Schwarzenegger
The same entity may have similar representations
Arnold Schwarzeneger
Arnold Schwarzenegger
versus
Forrest Whittaker
Forest Whittacker
versus
Hobbies
…
…
…
…
…
Address
…
…
…
…
…
Motivation: Query Relaxation
Actual
queries
gathered by
Google
http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful
results closer together
What is Approximate String Search?
Queries against collection:
Find all entries similar to “Forrest Whitaker”
Find all entries similar to “Arnold Schwarzenegger”
Find all entries similar to “Brittany Spears”
What do we mean by similar to?
- Edit Distance
- Jaccard Similarity
- Cosine Similaity
- Dice
- Etc.
String Collection: (People)
Brad Pitt
Forest Whittacker
George Bush
Angelina Jolie
Arnold Schwarzeneger
…
…
…
The similar to predicate can help our described applications!
How can we support these types of queries efficiently?
Approximate Query Answering
Main Idea: Use q-grams as signatures for a string
irvine
Sliding Window
2-grams {ir, rv, vi, in, ne}
Intuition: Similar strings share a certain number of grams
Inverted index on grams supports finding all data strings sharing enough
grams with a query
Approximate Query Example
Query: “irvine”, Edit Distance 1
2-grams {ir, rv, vi, in, ne}
Lookup Grams
2-grams
Inverted
Lists
(stringIDs)
…
in
tf
vi
ir
ef
rv
1
3
4
5
7
9
5
9
1
5
1
2
3
9
3
9
7
9
Each edit operations can “destroy” at most q grams
Answers must share at least T = 5 – 1 * 2 = 3 grams
ne
un
…
1
2
4
5
6
Candidates = {1, 5, 9}
May have false positives
Need to compute real
similarity
5
6
9
T-Occurrence problem: Find elements occurring at least T=3 times among
inverted lists. This is called list-merging. T is called merging-threshold.
Outline
 XML data management
 XML twig query processing
 XML keyword search
 XML Keyword refinement
 Graphical and interactive XML query processing
 Approximate string matching
 Approximate string search
 Approximate member extraction
Introduction: An Example
 A dictionary of strings we are interested in
 E.g. product names, postal addresses…
 We are going to locate their “approximate
occurrence” in documents.
 See the meaning of “approximate occurrence” in
the following example:
Problem Definition
 Given a dictionary R and a threshold δ, extract
all proper substrings m from input documents
S such that there exists r ∈R, and Similarity (r,
m) ≥δ(or Distance(r, m) ≤k).
 Here we call r a piece of evidence for m.
 Similarity() is a function measuring the similarity
of two strings
 Strings are viewed as sets of tokens (words)
 An example for Sim(): Jaccard similarity:
J (r , m) 
wt (r  m)
wt (r  m)
Why pre-pruning is needed
 We need evidence to decide whether a
substring m should be extracted
 Simple verification on all dictionary strings may be
inefficient
 Pre-pruning and post-verifying is beneficial
 But should it be running-time-specific or filteringpower-specific?
 Less time or less survivors?
The issue of compromise comes again
 Balance between the two stages should be
reached:
More(less)
filtration time
Strong(weak)
filtration power
Fewer(more)
candidates
Overall performance
=Tf+Tv ?????
Less(more)
verification time
State-of-the-art techniques
——K-signature scheme
 K-signature scheme
 Proposed by Chakrabarti et al. (SIGMOD 2008)
 Choose several top-weighted tokens in a string as signatures to
represent it: s => Sig(s)
 Observation: if r cannot match m, r is likely to have insufficient
signature overlapping with m
 K is a parameter for filtration power tuning
 Potential evidence loss
 A counter-example found when k=3
 We tried and only proved that it works for k=1 and k=∞
State-of-the-art techniques
——Inverted Signature-based Hashtable
 Proposed by Chakrabarti et al. (SIGMOD 2008)
 Each dictionary string encoded into a solid 0-1 matrix
 An ‘1’ for each occurrence of a <token,sig-token> tuple
(‘1’- rectangle)
 Bitwise-or all solid matrices to get the matrix of R
 Observation: if m is an approximate member of R, the
matrix of m must have enough intersections with that of R.
 Formalized into an NPC problem
 Solution causes too weak filtering power
Our proposed theorem
 If Sim(m,r) ≥δ, what do we have ?
wt(Sig(m)∩Sig(r)) ≥ τ(m)
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }
 So the threshold does not remain constant
 involves unknown evidence
 Our solution: Use inverted lists to count sigtoken overlappings.
 Note that sig-tokens usually have low document
frequency (e.g. IDF as weights)
Our algorithms and evaluations
——EvSCAN:Filtration by SIL
 Signature-based Inverted Lists (SLH)
 Lists indexed by sig-tokens
 Each sig-token of a string creates a node (containing the string’s id)
in the corresponding list.
 E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slr
camera”, r3=“canon slr camera”}.
 wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9).
5d, 9.0
1
canon, 2.0
1
camera, 1.0
2
eos, 7.0
1
nikon, 2.0
2
slr, 2.0
2
3
3
Approximate string matching
 Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for
approximate member extraction using signature-based inverted lists.
CIKM 2009:315-324
 Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: SpaceConstrained Gram-Based Indexing for Efficient Approximate String
Search. ICDE 2009:604-615
 Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering
Algorithms for Approximate String Searches. ICDE 2008:257-266
 Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu,
Xiaoyong Du: Efficient Algorithm for Computing Link-Based
Similarity in Real World Networks. ICDM 2009:734-739
 ……
Thank you
Q&A
Download