# DATA MINING Notes

```VLDB Database School (China) 2010
August 3-7, 2010, Shenyang
Lecture Notes
Part 1
Mining and Searching Complex
Structures
Anthony K.H. Tung(邓锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Mining and Searching Complex Structures
Contents
Chapter 1: Introduction ------------------------------------------ 1
Chapter 2: High Dimensional Data ------------------------- 34
Chapter 3: Similarity Search on Sequences ------------ 110
Chapter 4: Similarity Search on Trees ------------------- 156
Chapter 5: Graph Similarity Search ---------------------- 175
Chapter 6: Massive Graph Mining ------------------------ 234
Mining and Searching Complex Structures
Chapter 1 Introduction
Mining and Searching Complex
Structures
Introduction
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
What is data mining?
Really nothing different from what scientists had been doing for
Correct,
useful
model
Generate
data
Real World
Collect data and verify or
construct model of real world
Nobel
Prize
Output most likely model
based on some statistical
measure
Feed in data
What’s new?
Systematically and
efficiently test
many statistical
models
1
Mining and Searching Complex Structures
Chapter 1 Introduction
Components of data mining
Structure of model
geneA=high and geneB=low ===&gt; cancer
geneA, geneB and geneC exhibit strong correlation
Statistical Score for the model
Accuracy of rule 1 is 90%
Similarity function: Are they sufficiently similar group of records
that support a certain model or hypothesis?
Search method for the correct model parameters
Given 200 genes, there could be 2^200 rules. Which rule give the
best prediction power?
Database access method
Given 1 million records, how to quickly find relevant records to
compute the accuracy of a rule?
The Apriori Algorithm
search
• Only read is perform on
the databases
• Store candidates in
memory to simulate the
lattice search
steps:
–generate candidates
–count and get actual
frequent items
a,b,c,e
a,b,c a,b,e a,c,e b,c,e
a,b
a,c
a,e
a
start
4
2
b
b,c
b,e
c
e
{}
c,e
Mining and Searching Complex Structures
Chapter 1 Introduction
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in 4
steps:
–Partition objects into k nonempty subsets
–Compute seed points as the centroids of the clusters of the
current partition. The centroid is the center (mean point) of the
cluster.
–Assign each object to the cluster with the nearest seed point.
–Go back to Step 2, stop when no more new assignment.
5
The K-Means Clustering Method
• Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
2
3
4
5
6
7
8
9
10
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
6
3
1
2
3
4
5
6
7
8
9
10
Mining and Searching Complex Structures
Chapter 1 Introduction
Training Dataset (Decision Tree)
Outlook
Temp
Humid
Wind
PlayTennis
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
7
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Selecting the Next Attribute
S=[9+,5-]
E=0.940
S=[9+,5-]
E=0.940
Humidity
Wind
High
[3+, 4-]
E=0.985
Normal
Weak
[6+, 1-]
[6+, 2-]
E=0.592
Gain(S,Humidity)
=0.940-(7/14)*0.985
– (7/14)*0.592
=0.151
8
4
Strong
[3+, 3-]
E=0.811
E=1.0
Gain(S,Wind)
=0.940-(8/14)*0.811
– (6/14)*1.0
=0.048
Mining and Searching Complex Structures
Chapter 1 Introduction
Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook
Over
cast
Sunny
Rain
[2+, 3-]
[4+, 0]
[3+, 2-]
E=0.971
E=0.0
E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
9
ID3 Algorithm
[D1,D2,…,D14]
[9+,5-]
Outlook
Sunny
Overcast
Ssunny=[D1,D2,D8,D9,D11]
[2+,3-]
?
Rain
[D3,D7,D12,D13] [D4,D5,D6,D10,D14]
[4+,0-]
[3+,2-]
Yes
?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
10
5
Mining and Searching Complex Structures
Chapter 1 Introduction
Decision Tree for PlayTennis
Outlook
Sunny
Overcast
Humidity
High
Rain
Yes
Wind
Normal
No
Strong
Yes
Weak
No
Yes
11
Can we fit what we learn into the framework?
Apriori
K-means
ID3
rule pattern discovery clustering
clusters
structure of the model association rules
or pattern
choice of any k
lattice of all possible
search space
classification
decision tree
combination of items
size= 2m
points as center
size=infinity
score function
support, confidence
square error
search /optimization
method
data management
technique
pruning
all possible
combination of
decision tree
size= potentially
infinity
accuracy,
information gain
greedy
TBD
TBD
TBD
12
6
Mining and Searching Complex Structures
Chapter 1 Introduction
Components of data mining(II)
Models Enumeration
Algorithm
Statistical Score Function
Similarity/Search Function
Database Access Method
Database
Background knowledge
• We assume you have some basic knowledge about data
mining, some of the slides here will be very useful for this
purpose
• Association Rule Mining
http://www.comp.nus.edu.sg/~atung/Renmin56.pdf
• Classification and Regression
http://www.comp.nus.edu.sg/~atung/Renmin67.pdf
• Clustering
http://www.comp.nus.edu.sg/~atung/Renmin78.pdf
7
Mining and Searching Complex Structures
Chapter 1 Introduction
IT Trend
Processors are cheap and will become cheaper(multi-core processor,
graphic cards)
Storage will be cheap but might not be fast
Bandwidth will be growing
What can we do with this?
Play more realistic games!
Not exactly a joke since any technologies that speed up games can speed up scientific
simulation
Smarter (more intensive) computation
Can store more personal semantic/ontology
People can collaborate more over the Internet (Flickr, Wikipedia) to make
things more intelligent
The AI dream now have the support of much better hardwares
Essentially, data mining can be made much more simple for the man
on the street
Data mining should be human-centered, not machine centered
2010-7-31
15
What is complex data?
What
is “simple” data?
zWhat are complex data?
tabular table, with small number of attributes (of the same type),
no
Test1 Regular
Gene1
Progress
Pos missing
2.0
Fever
……
values.
Neg
-0.3
N.A
5.7
Unconscious
High dimensional data: Lots of attributes with different data types with missing values
Sequences/ time series
Trees
8
Graphs
Mining and Searching Complex Structures
Chapter 1 Introduction
Why complex data?
They come naturally in many applications. Bring research nearer to real
world
Lots of challenges which mean more fun!
Some fundamental challenges:
How do you compare complex objects effectively and efficiently?
How
do you
find special subset in the data that is interesting?
Test1
Gene1
Progress
Pos What
2.0 new
Fever
type of models and score function must you used?
Neg
-0.3
Unconscious
How
do you
handle noise and error ?
N.A
……
5.7
a
a
d
d
c
b
e
c
d
b
c
c
d
d
b
e
T2
T1
Personalized Semantic for Personal Data Management
everyone will own terabytes of data soon
improve query/search interface by mining and extracting personalized semantics like
entities and their relationship etc. by comparing them against high quality tagged databases
Query by
documents
Query by
audio/music
Query by video
Query by
photographs/images
Wikipedia
High Quality
Data
Sources
singers
authors
actor/actress
songs
papers
semantic
layer
places
movies
Personal Data
documents
audio
music
video
9
photographs/i
mages
Webpage/Blogs/Bookmarks
Mining and Searching Complex Structures
Chapter 1 Introduction
Integrated Approach to Mining Software Engineering Data
software engineering data: code base, change history, bug reports, runtime trace
integrated into a data warehouse to support decision making and mining,
Example: Which code module should I modify to create a new function? Which
module need maintenance?
programming
defect detection
testing
debugging
maintenance
…
software engineering tasks helped by data mining
classification
association/
patterns
clustering
…
Data Warehouse
code
bases
change
history
program
states
structural
entities
bug
reports/nl
…
software engineering data
WikiScience
Collaborative platform for scientist to build scientific models/hypothesis and share
data, applications
Based on some
articles, I make some
changes to Model A
to create Model B
Model
ModelAA
supporting
articles tagged to
Model B
Centralized,
Centralized,
Hybrid
HybridModel
Model
CCConstructed
Constructed
by
bySystem
System
supporting
dataset tagged to
Model A
Model
ModelBB
This is my model of
the solar system base
on my supporting
dataset
10
Mining and Searching Complex Structures
Chapter 1 Introduction
Hey, why not Cloud Computing, Map/Reduce?
• These are platform for scaling up services to large
number of users on large amount of data
• But what exactly do you want to scale up?
• Services that provide useful and semantically
correct information to the users
• We have too many scalable data mining
algorithms that find nothing or too many things
• Let’s focus on finding useful things first
(assuming we have lot’s of processing power) and
then try to scale it up
Schedule of the Course
Date/Time
Content
Lesson 1
Introduction
Lesson 2
Mining and Search High Dimensional Data I
Lesson 3
Mining and Search High Dimensional Data II
Lesson 4
Mining and Search High Dimensional Data III
Lesson 5
Similarity Search for Sequences and Trees I
Lesson 6
Similarity Search for Sequences and Trees III
Lesson 7
Similarity Search for Graph I
Lesson 8
Similarity Search for Graph II
Lesson 9
Similarity Search for Graph III
Lesson 10
Mining Massive Graph I
Lesson 11
Mining Massive Graph II
Lesson 12
Mining Massive Graph III
11
Mining and Searching Complex Structures
Chapter 1 Introduction
Focus of the course
• Techniques that can handle high dimensional, complex
structures
–Providing semantics to similarity search
–Shotgun and Assembly: Column/Feature Wise Processing using
Inverted Index
–Row-wise Enumeration
–Using local properties to infer global properties
• Throughout the course, please try to think of how these
techniques are applicable across different type of complex
structures
Databases Queries
z
z
z
z
z
To start off, we will consider something very basic call
ranking queries since we need ranking any similarity search
(usually from most similar to most dissimilar)
In relational database, SQL returns all results at one go
How many tuples can be fitted in one screen?
How many tuples can you remember?
Options:
z
z
z
Summarize the results
Display representative tuples
How to select representative tuples?
12
Mining and Searching Complex Structures
Chapter 1 Introduction
Retrieve Relevant Information
z
z
z
z
z
Search videos related to Shanghai Expo
Too many results: as long as you click “next”, there are 20
more new results
Are we interested in all results?
No, only most relevant ones
Search engines have to rank the results, out of which they
make money from
Question: How to Select a Small Result Set
z
z
Selecting the most representative or most interesting results
is not trivial
Find an apartment with rental cheaper than 1000, the
cheaper the better
z
z
The result tuples can be sorted in the ascending order of rental prices,
those in front are more favorable
Find an apartment with rental cheaper than 1000 near NEU,
the lower the better, the nearer the better
z
z
Apartment with lower rent may not be near, nearer one may not be
cheap
Order by prices? Order by distances?
13
Mining and Searching Complex Structures
Chapter 1 Introduction
Top-k Queries
z
z
z
z
z
z
Define a scoring function, which maps a tuple to a real
number, as a score
The higher the score is, the more favorable the tuple is
Define an integer k
Answer: k objects with highest scores
Different scoring function may give different top-k result
Price
Distance to NEU
Apartment A
\$800
500 meter
Apartment B
\$1200
200 meters
Given k = 1, if the score function is defined as the sum of
price and distance, the first tuple is better; if it is defined as
the product, the second tuple is better
Brute Force Top-k
z
z
z
z
z
z
Compute scores for each result tuple
Sort the tuples according to the descending order of the
scores
Select the first k tuples
What if the number of tuples is unlimited? Search engines
can give unlimited number of results
Even if the number of tuples is limited, it is too slow to
compute score for each tuple
We have to do it efficiently
14
Mining and Searching Complex Structures
Chapter 1 Introduction
Outline
z
z
Two well-known top-k algorithms
z
Fagin's Algorithm (FA)
z
The Threshold Algorithm (TA)
Take random access into consideration
z
No Random Access Algorithm (NRA)
z
The Combined Algorithm (CA)
Monotonicity
z
z
A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym)
whenever xi≤yi for every i
Select top-3 students with highest total score in mathematics,
physics and computer science:
•
select name, math+phys+comp as score
from student
order by score desc limit 3
z
sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if
x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp
15
Mining and Searching Complex Structures
Chapter 1 Introduction
Sorted Lists
z
We shall think of a database consisting of m sorted lists L1,
L2, … Lm
Lmath
Lphys
Ann
98
Hugh
97
z
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Outline
z
Lcomp
Two well-known top-k algorithms
z
Fagin's Algorithm (FA)
z
The Threshold Algorithm (TA)
Take random access into consideration
z
No Random Access Algorithm (NRA)
z
The Combined Algorithm (CA)
16
Mining and Searching Complex Structures
Chapter 1 Introduction
Fagin's Algorithm (I)
z
z
Do sequential access until there are at least k matches
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Sequential accesses are stopped when 3 students are seen, i.e.
Ann, Hugh and Kurt
Fagin's Algorithm (II)
z
z
For each object that has been seen, do random accesses on
other lists to compute its score
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Random accesses need to be done for Ben, Carl, Jane and
Ryan
17
Mining and Searching Complex Structures
Chapter 1 Introduction
Fagin's Algorithm (III)
z
Select the k objects with highest score as top-k result
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Why is FA correct? (I)
z
z
There are at least k objects seen on all attributes when
sequential access is stopped
By monotonicity, those objects that are not seen do not have
higher score than the above k objects
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
18
Mining and Searching Complex Structures
Chapter 1 Introduction
Why is FA correct? (II)
z
z
For those that have been seen, it is either all attributes has
been seen, or random accesses are performed to know all
attributes
The k objects with highest scores are therefore the top-k
result
Ann
98
Hugh
97
z
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Outline
z
Kurt
Two well-known top-k algorithms
z
Fagin's Algorithm (FA)
z
The Threshold Algorithm (TA)
Take random access into consideration
z
No Random Access Algorithm (NRA)
z
The Combined Algorithm (CA)
19
Mining and Searching Complex Structures
Chapter 1 Introduction
The Threshold Algorithm (I)
z
z
Do sequential access on all lists. If an object is seen, do
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Random accesses on Ann, Hugh and Kurt first, then on Ben
and Ryan
The Threshold Algorithm (II)
z
z
Remember the k objects with highest scores, together with
their scores
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Score (Ann) = 285
Score (Hugh) = 280
Score (Kurt) = 280
20
Mining and Searching Complex Structures
Chapter 1 Introduction
The Threshold Algorithm (III)
• Let threshold value τ be the function value on last seen values
on all sorted lists
• As soon as at least k objects with score at least τ, then halt
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
τ(1) = 291
τ(2) = 285
τ(3) = 280
Why is TA correct?
• By monotonicity, those unseen objects do not have higher
score than τ
• For those that have been seen, random accesses are
performed, the k objects with highest scores are therefore the
top-k result
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
21
τ(1) = 291
τ(2) = 285
τ(3) = 280
Mining and Searching Complex Structures
Chapter 1 Introduction
Comparing TA with FA
• Number of sequential accesses
z
At the time FA stops sequential accesses, τ is guaranteed not
higher than the k objects seen on all sorted lists
• Number of random accesses
z
TA requires m-1 random accesses for each object
z
But FA is expected to random access more objects
• Size of buffers used
z
Buffer used by FA can be unbounded
z
TA only needs to remember k objects with k scores, and the
threshold value τ
Outline
z
z
Two well-known top-k algorithms
z
Fagin's Algorithm (FA)
z
The Threshold Algorithm (TA)
Take random access into consideration
z
No Random Access Algorithm (NRA)
z
The Combined Algorithm (CA)
22
Mining and Searching Complex Structures
Chapter 1 Introduction
Random Access
z
z
z
Random accesses are impossible
z
Text retrieval: sorted lists are results of search engines
Random accesses are expensive
z
Sequential accesses on disk are orders of magnitude faster
than random accesses
We need to consider not using random accesses or using
them as few as possible
No Random Access
z
Without random access, all we know are the upper bounds
Lmath
z
Lphys
Ann
98
Hugh
97
Lcomp
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Carl’s scores on physics and computer science are not higher
than 89 and 92 respectively
23
Mining and Searching Complex Structures
Chapter 1 Introduction
Lower and Upper Bounds
z
z
If an object has not been seen on one attribute
z
Lower bound is 0
z
Upper bound is the last seen value
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
The lower bound of Carl’s score on physics is 0
The upper bound of Carl’s score on physics is 89
Worse and Best Scores (I)
z
z
z
W (R): The worst possible score of tuple R
B (R): The best possible score of tuple R
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
W (Carl) = 90
B (Carl) = 90 + 89 + 92
24
Mining and Searching Complex Structures
Chapter 1 Introduction
Worse and Best Scores (II)
z
z
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
Hugh
Kurt
W
98
97
96
B
291
291
291
Worse and Best Scores (II)
z
z
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
Hugh
Kurt
Ben
Ryan
W
98→193
97
96
96
94
B
291→287
291→288
291→286
285
285
25
Mining and Searching Complex Structures
Chapter 1 Introduction
Worse and Best Scores (II)
z
z
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
W 193→285
Hugh
Kurt
Ben
Ryan
Jane
97
96→189
96
94
95
B 287→285 288→285 286→281 285→283 285→282
Outline
z
z
Two well-known top-k algorithms
z
Fagin's Algorithm (FA)
z
The Threshold Algorithm (TA)
Take random access into consideration
z
No Random Access Algorithm (NRA)
z
The Combined Algorithm (CA)
26
280
Mining and Searching Complex Structures
Chapter 1 Introduction
No Random Access Algorithm (I)
z
z
Maintain the last-seen values x1,x2,…,xm
For every seen object, maintain its worst possible score, its
known attributes and their values
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
z
xmath = 96; xphys = 94; xcomp = 95
z
Ann:193:{&lt;Math:98&gt;;&lt;Comp:95&gt;}
No Random Access Algorithm (II)
z
Why not maintain the best possible score for each objects
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
W 193→285
Hugh
Kurt
Ben
Ryan
Jane
97
96→189
96
94
95
B 287→285 288→285 286→281 285→283 285→282
Too Frequently Updated!
27
280
Mining and Searching Complex Structures
Chapter 1 Introduction
No Random Access Algorithm (III)
z
z
Let M be the kth largest W value
An object R is viable if B (R) ≥ M
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Hugh
Kurt
Ben
Ryan
Jane
W 285
Ann
97→188
189→280
96→189
94
95
B
285→280 281→280 283→280 282→278 280→277
285
M = 189
No Random Access Algorithm (III)
z
z
Let M be the kth largest W value
An object R is viable if B (R) ≥ M
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
Hugh
Kurt
W
285
188→280
280
B
285
285→280
280
Ben
Ryan
Jane
188
94
95→184
280→278 278→276 277→274
28
M = 280
Mining and Searching Complex Structures
Chapter 1 Introduction
No Random Access Algorithm (IV)
z
z
z
Let set T contain objects with W (R) ≥ M
Halt when
z
There are at least k objects seen on all sorted lists
z
No viable objects left outside set T
Ann
Hugh
Kurt
W
285
188→280
280
B
285
285→280
280
Ben
Ryan
Jane
188
94
95→184
M = 280
280→278 278→276 277→274
T = {Ann, Hugh, Kurt}
Why is NRA correct?
z
z
z
W (R) ≤ Score of R ≤ B (R) always holds
If an object R is not viable, Score of R ≤ B (R) ≤ M, then
there are at least k objects with scores not lower than R
Therefore, if there is no viable object outside T and T
contains at least k objects, T is the set of top-k result
29
Mining and Searching Complex Structures
Chapter 1 Introduction
Comparing NRA with TA
• Number of sequential accesses
z
The number of sequential accesses of NRA is at least the last
position of top-k result on all attributes
• Number of random accesses
z
NRA is obviously 0
• Size of buffers used
z
TA remembers k objects with k scores, and the threshold
value τ
z
NRA remembers all viable objects with its scores on all seen
attributes, and the last-seen value on all attributes
How deep can NRA go?
z
z
Ann
98
Hugh
97
Kurt
96
Hugh
97
Kurt
96
Ann
95
Ben
60
Ryan
60
Jane
60
Ryan
60
Ben
60
Ben
60
Carl
60
Jane
60
Carl
60
...
...
...
...
...
...
Jane
60
Carl
60
Ryan
60
Kurt
0
Ann
0
Hugh
0
The set T can be identified quickly, but their scores will only
be certain at the end of lists
If we allow relatively fewer number of random accesses,
scanning the entire lists can be avoided
30
Mining and Searching Complex Structures
Chapter 1 Introduction
Outline
z
z
Two well-known top-k algorithms
z
Fagin's Algorithm (FA)
z
The Threshold Algorithm (TA)
Take random access into consideration
z
No Random Access Algorithm (NRA)
z
The Combined Algorithm (CA)
The Combined Algorithm (I)
z
z
z
z
z
z
CA combines TA and NRA
cR: the cost of a random access
cS: the cost of a sequential access
h=
Run NRA, but every h steps to run random accesses, like TA
h = ∞ → never do random access, CA is then NRA
31
Mining and Searching Complex Structures
Chapter 1 Introduction
The Combined Algorithm (II)
z
Ann
98
Hugh
97
Kurt
96
Hugh
97
Kurt
96
Ann
95
Ben
60
Ryan
60
Jane
60
Ryan
60
Ben
60
Ben
60
Carl
60
Jane
60
Carl
60
...
...
...
...
...
...
Jane
60
Carl
60
Ryan
60
Kurt
0
Ann
0
Hugh
0
Random accesses for Ann, Hugh and Kurt quickly find out
the scores for Ann, Hugh and Kurt
The Combined Algorithm (III)
z
z
In CA, by doing random accesses, we wish to either
z
Confirm an object is a top-k result, or
z
Prune a viable object
As the number of random accesses in CA is limited, various
heuristics can be made to optimize CA in terms of total cost
32
Mining and Searching Complex Structures
Chapter 1 Introduction
Reference
• Ronald Fagin, Amnon Lotem, Moni Naor: Optimal
aggregation algorithms for middleware. J. Comput. Syst.
Sci. 66(4): 614-656 (2003)
33
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Mining and Searching Complex
Structures
High Dimensional Data
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
34
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Sources of High Dimensional Data
•
•
•
•
•
Microarray gene expression
Text documents
Images
Features of Sequences, Trees and Graphs
Audio, Video, Human Motion Database (spatiotemporal as well!)
Mining and Searching Complex Structures
Challenges of High Dimensional Data
• Indistinguishable
–Distance between two nearest points and two furthest points
could be almost the same
• Sparsity
–As a result of the above, data distribution are very sparse
giving no obvious indication on where the interesting
knowledge is
• Large number of combination
–Efficiency: How to test the number of combinations
–Effectiveness: How do we understand and interpret so many
combinations?
Mining and Searching Complex Structures
35
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
•
Objects represented by multidimensional vectors
Elevation
Aspect
Slope
2596
51
3
221
232
148
…
…
•
The traditional approach to similarity search: kNN query
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
Dist
P1
1.1
1
1.2
1.6
1.1
1.6
1.2
1.2
1
1
0.93
P2
1.4
1.4
1.4
1.5
1.4
1
1.2
1.2
1
1
0.98
P3
1
1
1
1
1
1
2
1
2
2
1.73
P4
20
20
21
20
22
20
20
19
20
20
57.7
P5
19
21
20
20
20
21
18
20
22
20
60.5
P6
21
21
18
19
20
19
21
20
20
20
59.8
Mining and Searching Complex Structures
36
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
•
Deficiencies
–Distance is affected by a few dimensions with high dissimilarity
–Partial similarities can not be discovered
•
The traditional approach to similarity search: kNN query
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
Dist
P1
1.1
1
100
1.2
1.6
1.1
1.6
1.2
1.2
1
1
0.93
99.0
P2
1.4
1.4
1.4
1.5
1.4
1
100
1.2
1.2
1
1
99.0
0.98
P3
1
1
1
1
1
1
2
1
100
2
2
1.73
99.0
P4
20
20
21
20
22
20
20
19
20
20
57.7
P5
19
21
20
20
20
21
18
20
22
20
60.5
P6
21
21
18
19
20
19
21
20
20
20
59.8
Mining and Searching Complex Structures
Thoughts
• Aggregating too many dimensional differences into a single value
result in too much information loss. Can we try to reduce that loss?
• While high dimensional data typically give us problem when in
come to similarity search, can we turn what is against us into
• Our approach: Since we have so many dimensions, we can
compute more complex statistics over these dimensions to
overcome some of the “noise” introduce due to scaling of
dimensions, outliers etc.
Mining and Searching Complex Structures
37
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
The N-Match Query : Warm-Up
•
Description
–Matches between two objects in n dimensions. (n ≤ d)
–The n dimensions are chosen dynamically to make the two objects match best.
•
How to define a “match”
–Exact match
–Match with tolerance δ
•
The similarity search example
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID
d1
d2
d3
d4
P1
1.1
1
100
1.2
1.6
P2
1.4
1.4
1.4
1.5
n=6
d5
d6
d7
d8
d9
d10
Dist
1.1
1.6
1.2
1.2
1
1
0.2
1.4
1
100
1.2
1.2
1
1
0.4
0.98
P3
1
1
1
1
1
1
2
1
100
2
2
1.73
0
P4
20
20
21
20
22
20
20
19
20
20
19
P5
19
21
20
20
20
21
18
20
22
20
19
P6
21
21
18
19
20
19
21
20
20
20
19
Mining and Searching Complex Structures
The N-Match Query : The Definition
•
The n-match difference
Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi
= |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let
the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference
y
between P and Q.
•
The n-match query
Given a d-dimensional database DB, a query point Q and an
integer n (n≤d), find the point P ∈ DB that has the smallest
n-match difference to Q. P is called the n-match of Q.
10
E
8
D
1-match=A
2-match=B
6
4
A
B
•
C
2
The similarity search example
n=8
6
7
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Q
2
4
6
8
ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
Dist
P1
1.1
1
100
1.2
1.6
1.1
1.6
1.2
1.2
1
1
0.2
0.6
P2
1.4
1.4
1.4
1.5
1.4
1
100
1.2
1.2
1
1
0.4
0.98
P3
1
1
1
1
1
1
2
1
100
2
2
1.73
0
1
P4
20
20
21
20
22
20
20
19
20
20
19
P5
19
21
20
20
20
21
18
20
22
20
19
P6
21
21
18
19
20
19
21
20
20
20
19
Mining and Searching Complex Structures
38
10
x
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
The N-Match Query : Extensions
•
The k-n-match query
Given a d-dimensional database DB, a query point Q, an integer k, and an
integer n, find a set S which consists of k points from DB so that for any
point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller
y
than P2’s n-match difference. S is called the k-n-match of Q.
•
The frequent k-n-match query
•
Given a d-dimensional database DB, a query point Q, an integer
k, and an integer range [n0, n1] within [1,d], let S0, …, Si be
the answer sets of k-n0-match, …, k-n1-match, respectively,
find a set T of k points, so that for any point P1 ∈ T and any point
P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger
than or equal to P2’s number of appearances in S0, …, Si .
10
E
8
D
The similarity search example
Q
2-1-match={A,D}
2-2-match={A,B}
6
4
A
B
C
2
2
4
6
8
n=6
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
P1
1.1
1
100
1.2
1.6
1.1
1.6
1.2
1.2
1
1
Dist
0.2
P2
1.4
1.4
1.4
1.5
1.4
1
100
1.2
1.2
1
1
0.4
0.98
P3
1
1
1
1
1
1
2
1
100
2
2
1.73
0
P4
20
20
21
20
22
20
20
19
20
20
19
P5
19
21
20
20
20
21
18
20
22
20
19
P6
21
21
18
19
20
19
21
20
20
20
19
Mining and Searching Complex Structures
Cost Model
•
The multiple system information retrieval model
–Objects are stored in different systems and scored by each system
–Each system can sort the objects according to their scores
–A query retrieves the scores of objects from different systems and then combine them using some
aggregation function
Q : color=“red” &amp; shape=“round” &amp; texture “cloud”
System 1: Color
•
System 2: Shape
Object ID
Score
Object ID
Score
Object ID
Score
1
0.4
0.4
1
1.0
1.0
1
1.0
1.0
2
2.8
2.8
5
2
1.5
5.5
2
2.0
2.0
3
5
3.5
6.5
2
3
5.5
7.8
3
5.0
5.0
3
4
6.5
9.0
3
4
7.8
9.0
4
5
8.0
9.0
4
5
9.0
3.5
4
5
9.0
1.5
5
4
9.0
8.0
The cost
–Retrieval of scores – proportional to the number of scores retrieved
•
System 3: Texture
The goal
–To minimize the scores retrieved
Mining and Searching Complex Structures
39
10
x
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
•
The AD algorithm for the k-n-match query
–Locate the query’s attributes value in every dimension
–Retrieve the objects’ attributes value from the query’s attributes in both directions
–The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found
when it appears n times.
Q : color=“red”
2-2-match
&amp;Qshape=“round”
: (of
3.0
Q ,:7.0
( 3.0
, 4.0
, 7.0
&amp;) texture
, 4.0 ) “cloud”
Systemd11: Color
Systemd2
2: Shape
Object ID
Score
Attr
Object ID
Score
Attr
Object ID
Score
Attr
1
0.4
1
1.0
1
1.0
2
2.8
5
1.5
2
2.0
5
3.5
2
5.5
3
5.0
3
6.5
3
7.8
5
8.0
4
9.0
4
9.0
4
9.0
3.0
Auxiliary structures

Next attribute to retrieve g[2d]

Number of appearances appear[c]

System d3
3: Texture
7.0
d1
2 ,, 2.6
0.2
1
d2
0.5
35 ,, 3.5
2 , 1.5
4.0
d3
3
0.8
4 ,, 2.0
2 , 2.0
1.0
53 ,, 4.0
1
2
3
4
5
0
20
1
2
10
0
1
0
{ 3 , {2{3} }
Mining and Searching Complex Structures
•
The AD algorithm for the frequent k-n-match query
–The frequent k-n-match query
• Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1match of the query, S0, S1, ... , Si.
• Find k objects that appear most frequently in S0, S1, ... , Si.
–Retrieve the same number of attributes as processing a k-n1-match query.
•
Disk based solutions for the (frequent) k-n-match query
• Sort each dimension and store them sequentially on the disk
• When reaching the end of a disk page, read the next page from disk
–Existing indexing techniques
• Tree-like structures: R-trees, k-d-trees
• Mapping based indexing: space-filling curves, iDistance
• Sequential scan
• Compression based approach (VA-file)
Mining and Searching Complex Structures
40
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments : Effectiveness
•
Searching by k-n-match
–COIL-100 database
–54 features extracted, such as color histograms, area moments
k-n-match query, k=4
n
Images returned
5
36, 42, 78, 94
10

Searching by frequent k-nmatch


UCI Machine learning repository
Competitors:

IGrid

Human-Computer Interactive NN
search (HCINN)
27, 35, 42, 78
15
3, 38, 42, 78
20
27, 38, 42, 78
25
35, 40, 42, 94
30
10, 35, 42, 94
35
35, 42, 94, 96
40
35, 42, 94, 96
45
35, 42, 94, 96
50
35, 42, 94, 96
13, 35, 36, 40, 42
64, 85, 88, 94, 96
Freq. k-n-match
Data sets (d)
IGrid
HCINN
80.1%
86%
87.5%
Segmentation (19)
79.9%
83%
87.3%
Wdbc (30)
87.1%
N.A.
92.5%
Glass (9)
58.6%
N.A.
67.8%
Iris (4)
88.9%
N.A.
89.6%
Experiments : Efficiency
Disk based algorithms for the Frequent k-n-mach query
–Texture dataset (68,040 records); uniform dataset (100,000 records)
–Competitors:
• VA-file
• Sequential scan
Mining and Searching Complex Structures
41
Images returned
10
Ionosphere (34)
Mining and Searching Complex Structures
•
kNN query
k
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments : Efficiency (continued)
•
Comparison with other similarity search techniques
–Texture dataset ; synthetic dataset
–Competitors:
• Frequent k-n-match query using the AD algorithm
• IGrid
• scan
Mining and Searching Complex Structures
Future Work(I)
• We now have a natural way to handle similarity search for
data with categorical , numerical and attributes. Investigating
k-n-match performance on such mixed-type data is currently
under way
• Likewise, applying k-n-match on data with missing or
uncertain attributes will be interesting
• Query={1,1,1,1,1,1,1,M,No,R}
ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
P1
1.1
1
1.2
1.6
1.1
1.6
1.2
M
Yes
R
P2
1.4
1.4
1.4
1.5
1.4
1
1.2
F
No
B
P3
1
1
1
1
1
1
2
M
No
B
P4
20
20
21
20
22
20
20
M
Yes
G
P5
19
21
20
20
20
21
18
F
Yes
R
P6
21
21
18
19
20
19
21
F
Yes
Y
Mining and Searching Complex Structures
42
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Future Work(I)
• We now have a natural way to handle similarity search for
data with categorical , numerical and attributes. Investigating
k-n-match performance on such mixed-type data is currently
under way
• Likewise, applying k-n-match on data with missing or
uncertain attributes will be interesting
• Query={1,1,1,1,1,1,1,M,No,R}
ID
d1
P1
P2
1.4
d2
d3
d4
d5
d6
d7
d8
1
1.2
1.6
1.1
1.6
1.2
M
1
1.2
F
No
2
M
No
20
20
M
1.4
P3
1
1
P4
20
20
P5
19
21
P6
21
1.5
1
20
1
1
20
22
20
20
18
20
21
18
F
d9
d10
R
B
B
G
Yes
R
Yes
Y
Mining and Searching Complex Structures
Future Work(II)
• In general, three things affect the result from a similarity search:
noise, scaling and axes orientation. K-n-match reduce the effect of
noise. Ultimate aim is to have a similarity function that is robust
to noise, scaling and axes orientation
• Eventually will look at creating mining algorithms using k-nmatch
Mining and Searching Complex Structures
43
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Motivation
query
Large
Data Sets
results
Ever-increasing data collection rates of modern
enterprises and the need for effective, guaranteedquality approximate answers to queries
Concern: compress as much as possible.
Mining and Searching Complex Structures
44
22
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Conventional Compression Method
• Try to find the optimal encoding of arbitrary strings for
the input data:
–Huffman Coding
–Lempel-Ziv Coding (gzip)
• View the whole table as a large byte string
• Statistical or dictionary based
• Operate at the byte level
Mining and Searching Complex Structures
23
Why not just “syntactic”?
• Do not exploit the complex dependency patterns in the table
• Individual retrieval of tuple is difficult
• Do not utilize lossy compression
Mining and Searching Complex Structures
45
24
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Semantic compression methods
• Derive a descriptive model M
• Identify the data values which can be derived from M (within
some error tolerance), which are essential for deriving, and
which are the outliers
• Derived values need not to be stored, only the outliers need
Mining and Searching Complex Structures
25
• More Complex Analysis
–Example: detect correlation among columns
• Fast Retrieval
–Tuple-wise access
• Query Enhancement
–Possible to answer query directly from discover semantic
–Compress in way which enhanced answering of some complex
queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C.
Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004.
Choose a combination of compression methods
based on semantic and syntactic information
Mining and Searching Complex Structures
46
26
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Fascicles
• Key observation
–Often, numerous subsets of records in T have similar values for
many attributes
Protocol
http
http
http
http
http
ftp
ftp
ftp
Duration
12
16
15
19
26
27
32
18
Bytes Packets
20K
3
24K
5
20K
8
40K
11
58K
18
100K
24
300K
35
80K
15
• Compress data by storing
representative values (e.g.,
“centroid”) only once for each
attribute cluster
• Lossy compression:
information loss is controlled by
the notion of “similar values” for
attributes (user-defined)
Mining and Searching Complex Structures
27
ItCompress: Compression Format
Representative Rows (Patterns)
Original Table
RRid age salary
age salary credit sex
credit
sex
1
30
90k
good
F
2
70
35k
poor
M
20
30k
poor
M
25
76k
good
F
30
90k
good
F
40
100k
poor
M
50
110k
good
F
2
0111
60
50k
good
M
1
1111
70
35k
poor
F
1
1111
75
15k
poor
M
1
0100
40, poor, M
1
0111
50
1
0010
60, 50k, M
2
1110
F
2
1111
Compressed Table
RRid bitmap
Error Tolerance:
age salary credit sex
5
25k
0
0
Mining and Searching Complex Structures
47
Outlying
value
20
28
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Some definitions
• Error tolerance
–Numeric attributes
• The upper bound that x’ can be different from x
• x ∈ [ x’-ei, x’+ei ]
–Categorical attributes
• The upper bound on the probability that the compressed
value differs from actual value
• Given an actual value x and its error tolerance ei, the
compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei
Mining and Searching Complex Structures
29
Some definitions
• Coverage
–Let R be a row in the table T, and Pi be a pattern
–The coverage of Pi on R :
cov( Pi , R ) = number of attributes X i in which
R[ X i ] is match by Pi [ X i ]
• Total coverage
–Let P be a set of patterns P1,…,Pk; and the table T
contains n rows R1,…,Rn
–
totalcov ( P, T ) =
∑ cov( P
i =1..n
max
( Ri ), Ri )
30
Mining and Searching Complex Structures
48
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
ItCompress: basic algorithm
• First randomly choose k rows as initial patterns
• Scan the table T:
Phase1
–For each row R, compute the coverage of each pattern on it,
then try to find Pmax(R)
–Allocate R to its most covered pattern
• After each iteration, re-compute all patterns’ Phase2
attributes, always using the most frequent values
• Iterate until sum of total coverage does not increase
Mining and Searching Complex Structures
31
Example: the 1st iteration begins
RRid age salary
age salary credit
sex
20
30k
poor
M
1
20
25
76k
good
F
2
25
30
90k
good
F
40
100k
poor
M
50
110k
good
F
60
50k
good
M
70
35k
poor
F
75
15k
poor
M
age salary credit
sex
credit
sex
30k
poor
M
76k
good
F
Error Tolerance:
5
25k
0
0
32
Mining and Searching Complex Structures
49
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Example: Phase 1
RRid age salary
credit
sex
30k
poor
M
76k
good
F
age salary
credit
sex
20
30k
poor
M
40
100k
poor
M
60
50k
good
M
70
35k
poor
F
75
15k
poor
M
age salary credit
sex
age salary credit
sex
1
20
20
30k
poor
M
2
25
25
76k
good
F
30
90k
good
F
40
100k
poor
M
50
110k
good
F
60
50k
good
M
70
35k
poor
F
75
15k
poor
M
Error Tolerance:
age salary credit
5
25k
0
sex
0
25
76k
good
F
30
90k
good
F
50
110k
good
F
33
Mining and Searching Complex Structures
Example: Phase 2
RRid
age
salary credit
sex
1
20
70
30k
poor
M
M
2
25
25
90k
76k
good
FF
age salary credit
sex
20
30k
poor
M
25
76k
good
F
30
90k
good
F
age salary
credit
sex
40
100k
poor
M
20
30k
poor
M
50
110k
good
F
40
100k
poor
M
60
50k
good
M
60
50k
good
M
70
35k
poor
F
70
35k
poor
F
75
15k
poor
M
75
15k
poor
M
age salary credit
sex
Error Tolerance:
age salary credit
5
25k
0
sex
0
25
76k
good
F
30
90k
good
F
50
110k
good
F
Mining and Searching Complex Structures
50
34
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Convergence(I)
• Phase 1:
–When we assign the rows to their most coverage patterns:
• For each row, the coverage increases or maintain
&Icirc;So the total coverage also increases or maintain
• Phase 2:
–When we re-compute the attribute values for the patterns:
• For each pattern, the coverage increases or maintains
&Icirc;So the total coverage also increases or maintains
Mining and Searching Complex Structures
35
Convergence(II)
• In both Phase 1&amp;2, the total coverage is either increased
or maintained, and it has a obvious upper bound (cover
the whole table)
&Icirc; The algorithm will converge eventually
Mining and Searching Complex Structures
51
36
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Complexity
• Phase 1:
–In l iterations, we need to go through the n rows in the table and
match each row against the k patterns(2m comparisons,)
&Icirc;The running time complexity is O(kmnl) where m is the
number of attributes
• Phase 2:
–Computing each new pattern Pi will require going through all
the domain values/intervals of each value
&Icirc;Assuming the total number of domain values/intervals is d, the
running time complexity is O(kdl)
&Icirc;The total time complexity is O(kmnl+kdl)
Mining and Searching Complex Structures
37
• Simplicity and Directness
–Two phases process of Fascicle and Spartan
• Find rules/patterns
• Compress database using discovered rules/patterns
–ItCompress optimize the compression directly without finding
rules/patterns that may not be useful (a.k.a microeconomic approach)
• Less constraints
–Do not need patterns to be matched completely or rules that apply
globally
• Easily tuned parameters
Mining and Searching Complex Structures
52
38
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Performance Comparison
• Algorithms
–ItCompress, ItCompress+gzip
–Fascicles, Fascicles+gzip
–SPARTAN+gzip
• Platform
–ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory
–SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory)
• Datasets
–Corel: 32 numeric attributes, 35000 rows, 10.5MB
–Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB
–Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB
Mining and Searching Complex Structures
39
Effectiveness (Corel)
Mining and Searching Complex Structures
53
40
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Effectiveness (Census)
Mining and Searching Complex Structures
41
Effectiveness (Forest Cover)
Mining and Searching Complex Structures
54
42
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Efficiency
Mining and Searching Complex Structures
43
Mining and Searching Complex Structures
44
Varying k
55
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Varying Sample Ratio
Mining and Searching Complex Structures
45
Mining and Searching Complex Structures
56
46
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
20%
Corruption?
Effect of Corruption
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10 A11 A12
47
Mining and Searching Complex Structures
Effect of Corruption
A1
A2
A3
A4
A5
A6
20%
Corruption?
A7
A8
A9
A10 A11 A12
48
Mining and Searching Complex Structures
57
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Findings
•
ItCompress is
–More efficient than SPARTAN
–More effective than Fascicles
–Insensitive to parameter setting
–Robust to noises
Mining and Searching Complex Structures
49
Future work
• Can we perform mining on the compressed datasets using
only the patterns and the bitmap ?
–Example: Building Bayesian Belief Network
• Is ItCompress a good “bootstrap” semantic compression
algorithm ?
ItCompress
database
Compressed
database
Other Semantic
Compression Algorithms
50
Mining and Searching Complex Structures
58
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Metric v.s. Non-Metric
• Euclidean distance dominates DB queries
• Similarity in human perception
• Metric distance is not enough!
2010-7-31
Mining and Searching Complex Structures
59
52
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Bregman Divergence
h
(q,f(q))
convex function f(x)
(p,f(p))
Bregman divergence
Df(p,q)
q
p
Euclidean dist.
2010-7-31
Mining and Searching Complex Structures
53
Bregman Divergence
• Mathematical Interpretation
–The distance between p and q is defined as the difference
between f(p) and the first order Taylor expansion at q
f(x) at p
2010-7-31
first order Taylor expansion at q
Mining and Searching Complex Structures
60
54
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Bregman Divergence
• General Properties
–Non-Negativity
• Df(p,q)≥0 for any p, q
–Identity of Indiscernible
• Df(p,p)=0 for any p
–Symmetry and Triangle Inequality
• Do NOT hold any more
Mining and Searching Complex Structures
2010-7-31
55
Examples
2010-7-31
Distance
f(x)
Df(p,q)
Usage
KL-Divergence
x logx
p log (p/q)
distribution,
color histogram
Itakura-Saito
Distance
-logx
p/q-log (p/q)-1
signal, speech
Squared
Euclidean
x2
(p-q)2
Euclidean space
Von-Nuemann
Entropy
tr(X log X – X)
tr(X logX – X
logY – X + Y)
symmetric matrix
Mining and Searching Complex Structures
61
56
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Why in DB system?
• Database application
–Retrieval of similar images, speech signals, or time series
–Optimization on matrices in machine learning
–Efficiency is important!
• Query Types
–Nearest Neighbor Query
–Range Query
2010-7-31
Mining and Searching Complex Structures
57
Euclidean Space
• How to answer the queries
–R-Tree
2010-7-31
Mining and Searching Complex Structures
62
58
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Euclidean Space
• How to answer the queries
–VA File
2010-7-31
Mining and Searching Complex Structures
59
Our goal
• Re-use the infrastructure of existing DB system to support
Bregman divergence
–Storage management
–Indexing structures
–Query processing algorithms
2010-7-31
Mining and Searching Complex Structures
63
60
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Basic Solution
• Extended Space
–Convex function f(x) = x2
point
D1
D2
point
D1
D2
D3
p
0
1
p+
0
1
1
q
0.5
0.5
q+
0.5
0.5
0.5
r
1
0.8
r+
1
0.8
1.64
t
1.5
0.3
t+
1.5
0.3
3.15
2010-7-31
Mining and Searching Complex Structures
61
Basic Solution
• After the extension
–Index extended points with R-Tree or VA File
–Re-use existing algorithms with lower and upper bounds on
the rectangles
2010-7-31
Mining and Searching Complex Structures
64
62
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
How to improve?
• Reformulation of Bregman divergence
• Tighter bounds are derived
• No change on index construction or query processing
algorithm
Mining and Searching Complex Structures
2010-7-31
63
A New Formulation
h
h’
query vector vq
Df(p,q)+Δ
q
p
D*f(p,q)
2010-7-31
Mining and Searching Complex Structures
65
64
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Math. Interpretation
• Reformulation of similarity search queries
–k-NN query: query q, data set P, divergence Df
• Find the point p, minimizing
–Range query: query q, threshold θ, data set P
• Return any point p that
2010-7-31
Mining and Searching Complex Structures
65
Na&iuml;ve Bounds
• Check the corners of the bounding rectangles
2010-7-31
Mining and Searching Complex Structures
66
66
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Tighter Bounds
• Take the curve f(x) into consideration
2010-7-31
Mining and Searching Complex Structures
67
Query distribution
• Distortion of rectangles
–The difference between maximum and minimum distances
from inside the rectangle to the query
2010-7-31
Mining and Searching Complex Structures
67
68
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Can we improve it more?
• When Building R-Tree in Euclidean space
–Minimize the volume/edge length of MBRs
–Does it remain valid?
2010-7-31
Mining and Searching Complex Structures
69
Query distribution
• Distortion of bounding rectangles
–Invariant in Euclidean space (triangle inequality)
–Query-dependent for Bregman Divergence
2010-7-31
Mining and Searching Complex Structures
68
70
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Utilize Query Distribution
• Summarize query distribution with O(d) real number
• Estimation on expected distortion on any bounding
rectangle in O(d) time
• Allows better index to be constructed for both R-Tree and
VA File
Mining and Searching Complex Structures
2010-7-31
71
Experiments
• Data Sets
–KDD’99 data
• Network data, the proportion of packages in 72 different
TCP/IP connection Types
–DBLP data
• Use co-authorship graph to generate the probabilities of the
authors related to 8 different areas
2010-7-31
Mining and Searching Complex Structures
69
72
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiment
• Data Sets
–Uniform Synthetic data
• Generate synthetic data with uniform distribution
–Clustered Synthetic data
• Generate synthetic data with Gaussian Mixture Model
2010-7-31
Mining and Searching Complex Structures
73
Experiments
• Methods to compare
Basic
Improved
Bounds
Query
Distribution
R-Tree
R
R-B
R-BQ
VA File
V
V-B
V-BQ
Linear Scan
LS
BB-Tree
BBT
2010-7-31
Mining and Searching Complex Structures
70
74
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Existing Solution
• BB-Tree (L. Clayton, ICML 2009)
–Memory-based indexing tree
–Construct with k-means clustering
–Hard to update
–Ineffective in high-dimensional space
2010-7-31
Mining and Searching Complex Structures
75
Experiments
• Index Construction Time
2010-7-31
Mining and Searching Complex Structures
71
76
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments
• Varying dimensionality
2010-7-31
Mining and Searching Complex Structures
77
Experiments
• Varying dimensionality (cont.)
2010-7-31
Mining and Searching Complex Structures
72
78
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments
• Varying data cardinality
2010-7-31
Mining and Searching Complex Structures
79
Conclusion
• A general technique on similarity for Bregman Divergence
• All techniques are based on existing infrastructure of
commercial database
• Extensive experiments to compare performances with RTree and VA File with different optimizations
2010-7-31
Mining and Searching Complex Structures
73
80
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Motivation
• Probabilistic data is ubiquitous
–To represent the data uncertainty (WSN, RFID, moving
object monitoring)
–To compress data (image processing)
• Histogram is a good way to represent the prob. data
–Easy to capture
–Is very useful in image representation
•
•
•
•
Colors
Textures
Depth
Mining and Searching Complex Structures
74
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Motivation
• Similarity search is important for managing prob. data
are similar with sensor A (range query)
–Can answer which k pictures are similar (top-k query)
• Similarity function for prob. data should be carefully
chosen
–Bin by bin methods
• L1 and L2 norms
• χ2 distance
–Cross-bin methods
• Earth Mover’s Distance (EMD)
Mining and Searching Complex Structures
Outline
•
•
•
•
•
•
Motivation
Introduction to Earth Mover’s Distance (EMD)
Related works
Indexing the probabilistic data based on EMD
Experimental results
Conclusion and future work
Mining and Searching Complex Structures
75
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Introduction to Earth Mover’s Dist
• Bin by bin vs. cross bin
Bin-by-bin
Not good!
Cross bin
Good!
Can handle
distribution shift
Mining and Searching Complex Structures
Introduction to Earth Mover’s Dist
• What is EMD?
–Earth （泥土）
–Mover （搬运）
–Distance （代价）
–Can be understood as 搬运泥土的代价
• See an example…
Mining and Searching Complex Structures
76
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Moving Earth
≠
Mining and Searching Complex Structures
Moving Earth
≠
Mining and Searching Complex Structures
77
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Moving Earth
=
Mining and Searching Complex Structures
The Difference?
(amount moved)
=
Mining and Searching Complex Structures
78
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
The Difference?
Difference
(amount moved) * (distance moved)
=
Mining and Searching Complex Structures
Linear programming
P
m bins
(distance moved) * (amount moved)
Q
All movements
n bins
Mining and Searching Complex Structures
79
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Linear programming
P
m clusters
(distance moved) * (amount moved)
Q
n clusters
Mining and Searching Complex Structures
Linear programming
P
m clusters
* (amount moved)
Q
n clusters
Mining and Searching Complex Structures
80
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Linear programming
P
m clusters
Q
n clusters
Mining and Searching Complex Structures
Constraints
1. Move “earth” only from P to Q
P
m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
81
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Constraints
2. Cannot send more “earth” than
there is
P
m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
Constraints
3. Q cannot receive more “earth”
than it can hold
P
m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
82
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Constraints
4. As much “earth” as possible
must be moved
P
m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
The Formal Definition of EMD
• Earth Mover’s Distance (EMD)
–the minimum amount of work needed to change one
histogram into another
• Challenge of EMD
–O(N^3logN)
Mining and Searching Complex Structures
83
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Related Works
•
Filter-and-refine framework
–[1] Approximation Techniques for
Indexing the Earth Mover's Distance in
Multimedia Databases. ICDE 2006
• Cannot handle high
dimensional histograms
–[2] Efficient EMD-based Similarity
Search in Multimedia Databases via
Flexible Dimensionality Reduction.
SIGMOD 2008
• Based on scan framework and
influence the scalability
•
Use scanning scheme to
process queries
–Merit: can obtain a good order to access
when execute the k-NN queries and thus
can minimize the number of candidates
–Demerit: need to scan the whole dataset
to obtain the order and thus low algo.
scalability
Mining and Searching Complex Structures
Related Works
•
•
Related works
–Based on the filter-and-refine framework
–Based on scanning method and low scalability
Our work
–Also based on the filter-and-refine method
–But avoid to scan the whole data set
• Use B+ trees
• And thus can obtain high scalability
•
Our contributions
–To the best of our knowledge, the 1st paper to index the high
dimensional prob. data based on the EMD
–Proposed algorithms of processing the similarity query based on B+ tree
filter
–Improve the efficiency and scalability of EMD-based similarity search
Mining and Searching Complex Structures
84
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Indexing the probabilistic data
based on EMD
•
Our intuition:
–primal-dual theory in linear programming
•
Primal problem (EMD)
•
Dual problem
Mining and Searching Complex Structures
Indexing the probabilistic data based on
EMD
•
Good properties of dual space
–Constrains of dual space are independent of prob. data points (i.e., p and
q in this example)
• Thus, give any feasible solution (π, Ф) in dual space we can derives a
lower bound for EMD(p, q)
• Lower bound can help to filter out the not-hit histograms.
–given any feasible solution (π, Ф) in dual space, a histogram p can be
mapped as a value, using the operation of
• Can index histograms using B+ tree
Mining and Searching Complex Structures
85
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Indexing the probabilistic data based on EMD
• 1. Mapping Construction
–Key and counter key
Key
Counter key
–Assuming p is a histogram in DB, given a feasible solution
(π, Ф), we calculate the Key for each record in DB
–We can index those keys using B+ tree
–For each feasible solution (π, Ф), a B+ tree can be
constructed
Mining and Searching Complex Structures
• Range query based on B+ index
–Given any feasible solution (π, Ф) , we construct a B+ tree
using keys of histograms
–Given a query histogram, we calculate its counter key using
the operation of
–Given a similarity search threshold θ, we have proved that
all candidate histogram’s key can be bounded by
–To further filter the candidates, we use L B+ tree and make
an intersection among their candidate results
Mining and Searching Complex Structures
86
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
•
K-NN query based on B+ index
–Given a query q, we issue search on
each B+ tree Tl with key(q, Фl)
–We create two cursors for each tree and
let them to fetch records from different
directions (one left and one right)
–Whenever record r has already been
accessed by all B+ tree, it can be output
as a candidate for k-NN query
Mining and Searching Complex Structures
Experimental Setup
• 3 real data set
–RETINA1
• an image data set consists of 3932 feline retina scans labeled
with various antibodies.
–IRMA
• contains 10000 radiography images from the Image Retrieval
in Medical Application (IRMA) project
–DBLP
• With parameter setting
Mining and Searching Complex Structures
87
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experimental Results on
Query CPU Time
Mining and Searching Complex Structures
Experimental Results on
Scalability
sigmod
our
Mining and Searching Complex Structures
88
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Conclusions
• We present a new indexing scheme for the general
purposes of similarity search on Earth Mover's Distance
• Our index method relies on the primal-dual theory to
construct mapping functions from the original
probabilistic space to one-dimensional domain
• Our B+ tree-based index framework has
–High scalability
–High efficiency
–can handle High dimensional data
Mining and Searching Complex Structures
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
89
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
A Microarray Dataset
1000 - 100,000 columns
Class
100500
rows
Gene1
Sample1
Cancer
Sample2
Cancer
Gene2
Gene3
Gene4
Gene
5
Gene
6
.
.
.
SampleN-1
~Cance
r
SampleN
~Cance
r
• Find closed patterns which occur frequently among genes.
• Find rules which associate certain combination of the
columns that affect the class of the rows
–Gene1,Gene10,Gene1001 -&gt; Cancer
Mining and Searching Complex Structures
Challenge I
• Large number of patterns/rules
–number of possible column combinations is extremely high
• Solution: Concept of a closed pattern
–Patterns are found in exactly the same set of rows are grouped together
and represented by their upper bound
• Example: the following patterns are found in row 2,3 and 4
aeh
ae
upper
bound
(closed
pattern)
ah
eh
e
h
i
1
2
3
4
5
ri
a ,b,c,l,o,s
a ,d, e , h ,p,l,r
a ,c, e , h ,o,q,t
a , e ,f, h ,p,r
b,d,f,g,l,q,s,t
Class
C
C
C
~C
~C
“a” however not part of
the group
lower bounds
Mining and Searching Complex Structures
90
Ge
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Challenge II
• Most existing frequent pattern discovery algorithms perform
searches in the column/item enumeration space i.e. systematically
testing various combination of columns/items
• For datasets with 1000-100,000 columns, this search space is
enormous
this purpose. CARPENTER (SIGKDD’03) is the FIRST
Mining and Searching Complex Structures
Column/Item Enumeration Lattice
• Each nodes in the lattice represent
a combination of columns/items
• An edge exists from node A to B if
A is subset of B and A differ from
B by only 1 column/item
• Search can be done
i
1
2
3
4
5
ri
a,b,c,l,o,s
a,d,e,h,p,l,r
a,c,e,h,o,q,t
a,e,f,h,p,r
b,d,f,g,l,q,s,t
a,b,c,e
a,b,c a,b,e a,c,e b,c
a,b
Class
C
C
C
~C
~C
Mining and Searching Complex Structures
91
a,c
a,e
a
start
b
b,c
c
{}
b
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Column/Item Enumeration Lattice
• Each nodes in the lattice represent
a combination of columns/items
• An edge exists from node A to B if
A is subset of B and A differ from
B by only 1 column/item
• Search can be done depth first
• Keep edges from parent to child
only if child is the prefix of parent
i
1
2
3
4
5
ri
a,b,c,l,o,s
a,d,e,h,p,l,r
a,c,e,h,o,q,t
a,e,f,h,p,r
b,d,f,g,l,q,s,t
a,b,c,e
a,b,c a,b,e a,c,e b,c
a,b
Class
C
C
C
~C
~C
a,c
a,e
a
b,c
b
start
c
{}
Mining and Searching Complex Structures
General Framework for Column/Item Enumeration
Write-based
Point-based
Association Mining
Apriori[AgSr94],
DIC
Eclat,
MaxClique[Zaki01],
FPGrowth [HaPe00]
Hmine
Sequential Pattern
Discovery
GSP[AgSr96]
[Zaki98,Zaki01],
PrefixSpan
[PHPC01]
Iceberg Cube
Apriori[AgSr94]
BUC[BeRa99], HCubing [HPDW01]
Mining and Searching Complex Structures
92
b
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
A Multidimensional View
types of data
or knowledge
others
other interest
measure
associative
pattern
constraints
pruning method
sequential
pattern
iceberg
cube
compression method
closed/max
pattern
lattice transversal/
main operations
write
point
Mining and Searching Complex Structures
Sample/Row Enumeration Algorihtms
• To avoid searching the large column/item enumeration space, our
mining algorithm search for patterms/rules in the sample/row
enumeration space
• Our algorithms does not fitted into the column/item enumeration
algorithms
• They are not YAARMA (Yet Another Association Rules Mining
Algorithm)
• Column/item enumeration algorithms simply does not scale for
microarray datasets
Mining and Searching Complex Structures
93
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Existing Row/Sample Enumeration Algorithms
• CARPENTER(SIGKDD'03)
–Find closed patterns using row enumeration
• FARMER(SIGMOD’04)
–Find interesting rule groups and building classifiers based on them
• COBBLER(SSDBM'04)
–Combined row and column enumeration for tables with large
number of rows and columns
• Topk-IRG(SIGMOD’05)
–Find top-k covering rules for each sample and build classifier
directly
• Efficiently Finding Lower Bound Rules(TKDE’2010)
–Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao.
What is Unequal among the Equals? Ranking Equivalent Rules from
Gene Expression Data. Accepted in TKDE
Mining and Searching Complex Structures
Concepts of CARPENTER
ij R (ij )
i
1
2
3
4
5
ri
a,b,c,l,o,s
a,d,e,h,p,l,r
a,c,e,h,o,q,t
a,e,f,h,p,r
b,d,f,g,l,q,s,t
Class
C
C
C
~C
~C
Example Table
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
C
1,2,3
1
1,3
2
2,3
2,3
1,2
1,3
2
3
2
1
3
~C
4
5
5
4
4,5
5
4
5
4
5
4
5
5
Transposed Table,TT
Mining and Searching Complex Structures
94
a
e
h
C
1,2,3
2,3
2,3
TT|{2,3}
~C
4
4
4
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
ij
Row Enumeration
123
{a}
12
{al}
124
{a}
125
{l}
13
{aco}
14
{a}
1
{abclos}
2
{}
3
{acehoqt}
134
{a}
15
{bls}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
25
{dl}
4
{aefhpr}
1235
{}
ij
1245
{}
a
TT|{1} b
c
l
o
s
1345
{}
R (ij )
C
~C
1,2,3 4
1
5
1,3
1,2 5
1,3
1
5
2345
{}
ij
a
TT|{12} l
ij
TT|{124}
{123}
345
{}
35
{q}
5
{bdfglqst}
12345
{}
1234
{a}
235
{}
245
{}
34
{aeh}
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
a
R (ij )
C
~C
1,2,3 4
1
5
1,3
2
5
2,3
4
4,5
5
2,3
4
1,2
5
1,3
2
4
3
5
2
4
1
5
3
5
R (ij )
C
~C
1,2,3 4
1,2 5
R (ij )
C
~C
1,2,3 4
45
{f}
Mining and Searching Complex Structures
Pruning Method 1
•
Removing rows that appear in all tuples
of transposed table will not affect results
a
e
h
r2 r3
{aeh}
r2 r3 r4
{aeh}
r4 has 100% support in the conditional table of
“r2r3”, therefore branch “r2 r3r4” will be
pruned.
Mining and Searching Complex Structures
95
C
1,2,3
2,3
2,3
TT|{2,3}
~C
4
4
4
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Pruning method 2
123
{a}
12
{al}
1
{abclos}
2
{}
3
{acehoqt}
4
{aefhpr}
5
{bdfglqst}
13
{aco}
14
{a}
124
{a}
125
{l}
134
{a}
15
{bls}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
25
{dl}
34
{aeh}
35
{q}
1235
{}
1245
{}
1345
{}
2345
{}
235
{}
245
{}
345
{}
• if a rule is discovered
before, we can prune
enumeration below this
node
12345
{}
1234
{a}
a
e
h
C
1,2,3
2,3
2,3
45
{f}
–Because all rules below
this node has been
discovered before
–For example, at node 34, if
we found that {aeh} has
been found, we can prune
~Coff all branches below it
4
4
4
TT|{3,4}
Mining and Searching Complex Structures
Pruning Method 3: Minimum Support
• Example: From TT|{1}, we can see
that the support of all possible
pattern below node {1} will be at
most 5 rows.
TT|{1}
Mining and Searching Complex Structures
96
ij R (ij )
C ~C
a 1,2,3 4
b 1 5
c 1,3
l 1,2 5
o 1,3
s 1 5
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
From CARPENTER to FARMER
• What if classes exists ? What more can we
do ?
• Pruning with Interestingness Measure
–Minimum confidence
–Minimum chi-square
• Generate lower bounds for classification/
prediction
Mining and Searching Complex Structures
Interesting Rule Groups
• Concept of a rule group/equivalent class
–rules supported by exactly the same set of rows are grouped together
• Example: the following rules are derived from row 2,3 and 4 with
66% confidence
i
aeh--&gt; C(66%)
ae--&gt;C (66%)
ah--&gt; C(66%)
e--&gt;C (66%)
upper
bound
eh--&gt;C (66%)
h--&gt;C (66%)
lower bounds
1
2
3
4
5
ri
a ,b,c,l,o,s
a ,d, e , h ,p,l,r
a ,c, e , h ,o,q,t
a , e ,f, h ,p,r
b,d,f,g,l,q,s,t
a--&gt;C however is not in
the group
Mining and Searching Complex Structures
97
Class
C
C
C
~C
~C
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Pruning by Interestingness Measure
• In addition, find only interesting rule groups (IRGs) based
on some measures:
–minconf: the rules in the rule group can predict the class on
the RHS with high confidence
–minchi: there is high correlation between LHS and RHS of
the rules based on chi-square test
• Other measures like lift, entropy gain, conviction etc. can
be handle similarly
Mining and Searching Complex Structures
ij
Ordering of Rows: All Class C before ~C
123
{a}
12
{al}
1
{abclos}
2
{}
3
{acehoqt}
4
{aefhpr}
5
{bdfglqst}
13
{aco}
14
{a}
124
{a}
125
{l}
134
{a}
15
{bls}
135
{}
23
{aeh}
145
{}
24
{aehpr}
234
{aeh}
25
{dl}
34
{aeh}
35
{q}
1234
{a}
12345
{}
1235
{}
1245
{}
ij
a
TT|{1} b
c
l
o
s
1345
{}
R (ij )
C
~C
1,2,3 4
1
5
1,3
1,2 5
1,3
1
5
2345
{}
a
ij
TT|{124}
{123}
a
R (ij )
C
~C
1,2,3 4
45
{f}
Mining and Searching Complex Structures
98
ij
TT|{12} l
235
{}
245
{}
345
{}
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
R (ij )
C
~C
1,2,3 4
1
5
1,3
2
5
2,3
4
4,5
5
2,3
4
1,2
5
1,3
2
4
3
5
2
4
1
5
3
5
R (ij )
C
~C
1,2,3 4
1,2 5
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Pruning Method: Minimum Confidence
• Example: In TT|{2,3} on the right,
the maximum confidence of all rules
below node {2,3} is at most 4/5
a
e
h
C
1,2,3,6
2,3,7
2,3
~C
4,5
4,9
4
TT|{2,3}
Mining and Searching Complex Structures
Pruning method: Minimum chi-square

Same as in computing
maximum confidence
a
e
h
C
~C
Total
A
max=5
min=1
Computed
~A
Computed
Computed
Computed
Constant
Constant
Constant
Mining and Searching Complex Structures
99
C
1,2,3,6
2,3,7
2,3
TT|{2,3}
~C
4,5
4,9
4
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Finding Lower Bound, MineLB
a,b,c,d,e
abc
a
b
bd
be
cde
c
e
d
–Example: An upper bound
rule with antecedent A=abcde
and two rows (r1 : abcf ) and
(r2 : cdeg)
–Initialize lower bounds {a, b,
c, d, e}
{d ,e}
Candidate
Candidatelower
lowerbound:
ae,bd,
bd,bebe, cd, ce
Kept
Removed
since no
since
lower
d,ebound
are stilloverride
lower bound
them
Mining and Searching Complex Structures
Implementation
• In general, CARPENTER FARMER can be
implemented in many ways:
–FP-tree
–Vertical format
• For our case, we assume the dataset can be
fitted into the main memory and used
pointer-based algorithm similar to BUC
Mining and Searching Complex Structures
100
ij
a
b
c
d
e
f
g
h
l
o
p
q
r
s
t
R (ij )
C
~C
1,2,3 4
1
5
1,3
2
5
2,3
4
4,5
5
2,3
4
1,2
5
1,3
2
4
3
5
2
4
1
5
3
5
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experimental studies
• Efficiency of FARMER
–On five real-life dataset
• lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALLAML leukemia (ALL), Colon Tumor(CT)
–Varying minsup, minconf, minchi
–Benchmark against
• CHARM [ZaHs02] ICDM'02
• Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
• Usefulness of IRGs
–Classification
Mining and Searching Complex Structures
Example results--Prostate
100000
FA RM ER
10000
Co lumnE
1000
CHA RM
100
10
1
3
4
5
6
7
mi ni mum sup p o r t
Mining and Searching Complex Structures
101
8
9
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Example results--Prostate
1200
FA RM ER:minsup=1:minchi=10
1000
FA RM ER:minsup =1
800
600
400
200
0
0
50
70
80
85
90
99
minimum confidence(%)
Mining and Searching Complex Structures
Top k Covering Rule Groups
• Rank rule groups (upper bound) according to
– Confidence
– Support
• Top k Covering Rule Groups for row r
– k highest ranking rule groups that has row r as support and support
&gt; minimum support
• Top k Covering Rule Groups =
TopKRGS for each row
Mining and Searching Complex Structures
102
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Usefulness of Rule Groups
•
•
•
•
Rules for every row
Top-1 covering rule groups sufficient to build CBA classifier
No min confidence threshold, only min support
#TopKRGS = k x #rows
Mining and Searching Complex Structures
Top-k covering rule groups
• For each row, we find the most
significant k rule groups:
–based on confidence first
–then support
• Given minsup=1, Top-1
–row 1: abc&AElig;C1(sup = 2, conf= 100%)
–row 2: abc&AElig;C1
• abcd&AElig;C1(sup=1,conf = 100%)
–row 3: cd&AElig;C1(sup=2, conf = 66.7%)
• If minconf = 80%, ?
–row 4: cde&AElig;C2 (sup=1, conf = 50%)
class
Items
C1
a,b,c
C1
a,b,c,d
C1
c,d,e
C2
c,d,e
Mining and Searching Complex Structures
103
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Main advantages of Top-k coverage rule group
• The number is bounded by the product of k and the number
of samples
• Treat each sample equally &AElig; provide a complete description
for each row (small)
• The minimum confidence parameter-- instead k.
• Sufficient to build classifiers while avoiding excessive
computation
Mining and Searching Complex Structures
Top-k pruning
• At node X, the maximal set of rows covered by rules to
be discovered down X-- rows containing X and rows
ordered after X.
– minconf &Aring; MIN confidence of the discovered TopkRGs for all rows in the above
set
– minsup &Aring; the corresponding minsup
• Pruning
–If the estimated upper bound of confidence down X &lt; minconf &AElig; prune
–If same confidence and smaller support &AElig; prune
• Optimizations
Mining and Searching Complex Structures
104
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Classification based on association rules
• Step 1: Generate the complete set of association rules
for each class ( minimum support and minimum
confidence.)
–CBA algorithm adopts apriori-like algorithm -fails at this step on microarray
data.
• Step 2:Sort the set of generated rules
• Step 3: select a subset of rules from the sorted rule
sets to form classifiers.
Mining and Searching Complex Structures
Features of RCBT classifiers
Problems
RCBT
To discover, store, retrieve and
sort a large number of rules
Mine those rules to be used for
classification.e.g.Top-1 rule group
is sufficient to build CBA classifier
Default class not convincing for
biologists
Main classifier + some back-up
classifiers
Rules with the same
discriminating ability, how to
integrate?
Upper bound rules: specific
Lower bound rules: general
A subset of lower bound rules—
integrate using a score
considering both confidence and
support.
Mining and Searching Complex Structures
105
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experimental studies
• Datasets: 4 real-life data
• Efficiency of Top-k Rule mining
–Benchmark: Farmer, Charm, Closet+
• Classification Methods:
–CBA (build using top-1 rule group)
–RCBT (our proposed method)
–IRG Classifier
–Decision trees (single, bagging, boosting)
–SVM
Mining and Searching Complex Structures
Runtime v.s. Minimum support on ALL-AML dataset
10000
FARMER
FARMER(minconf=0.9)
FARMER+prefix(minconf=0.9)
TOP1
TOP100
Runtime(s)
1000
100
10
1
0.1
0.01
17
19
21
22
Minimum Support
Mining and Searching Complex Structures
106
23
25
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Scalability with k
100
Runtime(s)
PC
ALL
10
1
0.1
100
300
500
600
800
1000
k
Mining and Searching Complex Structures
Biological meaning –Prostate Cancer Data
Frequncy of Occurrence
1800
W72186
1600
1400
1200
AF017418
1000
AI635895
800
X14487
600
AB014519
M61916
400
Y13323
200
0
0
200
400
600
800
1000
1200
Gene Rank
Mining and Searching Complex Structures
107
1400
1600
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Classification results
Mining and Searching Complex Structures
Classification results
Mining and Searching Complex Structures
108
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
References
•
Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. &quot;Similarity Search:
A Matching Based Approach&quot;, VLDB'06
H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, &quot;ItCompress:
An Iterative Semantic Compression Algorithm&quot;. International Conference on Data
Engineering (ICDE'2004), Boston, 2004.
Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Similarity Search on Bregman Divergence: Towards Non-Metric Indexing. In the
Proceedings of the 35th International Conference on Very Large Data Bases(VLDB),
Lyon, France August 24-28, 2009.
Jia Xu, Zhenjie Zhang, Anthony K. H. Tung, and Ge Yu. &quot;Efficient and Effective
Similarity Search over Probabilistic Data Based on Earth Mover's Distance&quot;. to
appear in VLDB 2010, a preliminary version on Technical Report TRA5-10,
National University of Singapore. [Codes &amp; Data]
Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. &quot;Mining Top-k Covering
Rule Groups for Gene Expression Data&quot;. In Proceedings SIGMOD'05,Baltimore,
Maryland 2005
Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal
among the Equals? Ranking Equivalent Rules from Gene Expression Data.
Accepted in TKDE
•
•
•
•
•
Mining and Searching Complex Structures
Optional References:
•
•
•
•
•
Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki,
&quot;CARPENTER: Finding Closed Patterns in Long Biological Datasets&quot;,
In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003.
Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang.
&quot;FARMER: Finding Interesting Rule Groups in Microarray Datasets&quot;.
Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France.
Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. &quot;COBBLER:
Combining Column and Row Enumeration for Closed Pattern
Discovery&quot;. SSDBM 2004 Santorini Island Greece.
Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining
Frequent Closed Patterns in Microarray Data”. In IEEE International
Conference on Data Mining, (ICDM). 2004
Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. &quot;Mining Shifting-andScaling Co-Regulation Patterns on Gene Expression Profiles&quot;. ICDE
2006.
Mining and Searching Complex Structures
109
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Searching and Mining Complex
Structures
Similarity Search on Sequences
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Types of sequences
Symbolic vs Numeric
We only touch discrete symbols here. Sequences of number are called time
series and is a huge topic by itself!
Single dimension vs multi-dimensional
Example: Yueguo Chen, Shouxu Jiang, Beng Chin Ooi, Anthony K. H. Tung.
&quot;Querying Complex Spatial-Temporal Sequences in Human Motion Databases&quot;
accepted and to appear in 24th IEEE International Conference on Data Engineering
(ICDE) 2008
Single long sequence vs multiple sequences
2010-7-31
110
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Outline
• Searching based on a disk based suffix tree
• Approximate Matching Using Inverted List (Vgrams)
• Approximate Matching Based on B+ Tree (BED Tree)
2010-7-31
Suffix
Suffixes of acacag\$:
1.
2.
3.
4.
5.
6.
7.
acacag\$
cacag\$
acag\$
cag\$
ag\$
g\$
\$
2010-7-31
111
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Suffix Trie
E.g. consider the string S = acacag\$
\$
Suffix Trie: a ties of
all possible suffices of S
7
1
2
3
4
5
6
7
Suffix
acacag\$
cacag\$
acag\$
cag\$
ag\$
g\$
\$
a
c
g
6
a
\$
5
a
c
g
\$
4
a
\$
g
\$
g
a
c
g
c
g
3
\$
1
\$
2
2010-7-31
Suffix Tree (I)
Suffix tree for S=acacag\$: merge nodes with only one child
1 2 3 4 5 6 7
S= a c a c a g \$
\$
a
c
7
g
\$
c
a
Path-label of
node v is “aca”
Denoted as α(v)
v
c
ga
\$
1
5
g
\$
a
c
a
g
\$
2
3
2010-7-31
112
g \$
6
g
\$
4
“ca” is an
edge label
This is a
leaf edge
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Suffix Tree (II)
Suffix tree has exactly n leaves and at most n edges
The label of each edge can be represented using 2 indices
Thus, suffix tree can be represented using O(n log n) bits
\$
7
1 2 3 4 5 6 7
7,7
1,1 a
a
g
\$
c
a
4,7c
6,7
4,7
a
g
\$
2
5
6,7
g
\$
S= a c a c a g \$
6
2,3
2,3
c
ga
\$
1
c
g\$
6,7
Note: The end index of every
leaf edge should be 7, the last
index of S. Thus, for leaf edges,
we only need to store the start
index.
g 6,7
\$
4
3
2010-7-31
Generalized suffix tree
Build a suffix tree for two or more strings
E.g. S1 = acgat#, S2 = cgt\$
#
6
\$
4
a
c t
#
g
a
#t
1
4
c
g
g
a
t t\$
#
2
1
2010-7-31
113
t
a
t
#
3
t
\$
#
2
5
\$
3
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Straightforward construction of suffix tree
Consider S = s1s2…sn where sn=\$
Algorithm:
Initialize the tree we only a root
For i = n to 1
Includes S[i..n] into the tree
Time: O(n2)
2010-7-31
Example of construction
S=acca\$
Init
For-loop
a c
c
\$
\$
a
\$
c
\$ a\$ a
\$
5
5 4
5 4 3
5 4 3 2
I4
I3
I2
I5
\$ \$a
a ca
\$ \$
2010-7-31
114
c
\$ ca \$a
\$
4
1 3
5
\$
I1
c
a
\$
2
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Construction of generalized suffix tree
S’= c#
Init
For-loop
a c
a c
c
a c
\$ ca \$ a\$
\$
5 4 1 3 2
#\$ c
a c
\$ ca \$ a\$
\$
5 4 1 3 2
\$
2
I1
J2
a c
2
#\$ c
ac
\$ ca \$ a\$ #
\$
5 4 1 3 2 1
J1
2010-7-31
Property of suffix tree
Fact: For any internal node v in the suffix tree, if
the path label of v is α(v)=ap, then
there exists another node w in the suffix tree such that
α(w)=p.
Proof: Skip the proof.
For any internal node v, define its suffix link sl(v) = w.
2010-7-31
115
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
S=acacag\$
\$
a
7
g
\$
c
a
c
ga
\$
1
c
a
c
a
g
\$
2
5
g
\$
g \$
6
g
\$
4
3
2010-7-31
Can we construct a suffix tree in O(n)
time?
Yes. We can construct it in O(n) time and O(n) space
Weiner’s algorithm [1973]
Linear time for constant size alphabet, but much space
McGreight’s algorithm [JACM 1976]
Linear time for constant size alphabet, quadratic space
Ukkonen’s algorithm [Algorithmica, 1995]
Online algorithm, linear time for constant size alphabet, less space
Farach’s algorithm [FOCS 1997]
Linear time for general alphabet
Hon,Sadakane, and Sung’s algorithm [FOCS 2003]
O(n) bit space O(n logen) time for 0&lt;e&lt;1
O(n) bit space O(n) time for suffix array construction
But they are all in-memory algorithm that does not
guarantee locality of processing
2010-7-31
116
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Trellis Algorithm
A novel disk-based suffix tree construction
algorithm designed specifically for DNA
sequences
Scales gracefully for very large genome sequences
(i.e. human genome)
Unlike existing algorithms,
Trellis exhibits no data skew problem
Trellis has fast construction and query time
Trellis is a 4-step algorithm
2010-7-31
Trellis: Algorithm Overview
1. Variable-length prefixes: e.g. AA, ACA, ACC, …
R0
R1
Rr-1
S
TR0
TR1
TRr-1
2. Prefixed Suffix Sub-trees
TPi
TR1,Pm-1
TR0,P0
Disk
2010-7-31
117
TR0,Pi
3. Tree
Merging
TRr-1,Pi
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
1. Variable-length Prefix Creation
Goal: Separate the complete suffix tree by prefixes of
suffixes, such that each subtree can reside entirely in the
available memory
Frequency of Length-2 Prefixes for
Human Genome
300,000,000
Frequency
200,000,000
AG
150,000,000
Main Idea:
Expand prefixes
only as needed
TT
AA
250,000,000
AT
CA
CT
CC
AC
GC
100,000,000
50,000,000
TA
GA
GGGT
TG
TC
CG
0
0
5
10
15
20
Prefixes
2010-7-31
2. Suffix Tree
Partitioning
1. Variable-length prefixes: e.g. AA, ACA, ACC, …
R0
R1
Rr-1
S
TR0
TR1
TRr-1
2. Prefixed Suffix Sub-trees
TR1,Pm-1
TR0,P0
Disk
2010-7-31
118
• Use Ukkonen’s method because
of Its efficiency: O(n) time &amp;space
subtrees on disk
• Store enough information so that a
subtree can be rebuilt quickly, e.g. edge
starting index, edge length, node parent,
etc.
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
3. Suffix Tree Merging
1. Variable-length prefixes: e.g. AA, ACA, ACC, …
R0
R1
Rr-1
S
TR0
TR1
TRr-1
2. Prefixed Suffix Sub-trees
TPi
TR1,Pm-1
TR0,P0
Disk
2010-7-31
Merge Algorithm
T1
A
T2
C
G
T
Case 1: No common prefix
2010-7-31
119
TR0,Pi
3. Tree
Merging
TRr-1,Pi
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Merge Algorithm
T1
A
T2
C
T
G
Case 1: No common prefix
2010-7-31
Merge Algorithm
T1
A
C
T
A
G
Case 1: No common prefix
T2
T1
T2
CAAT
CAGGC
Case 2: Has common prefix
2010-7-31
120
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Merge Algorithm
T1
A
T1
T2
C
T
A
G
Case 1: No common prefix
T2
CA
AT
GGC
Case 2: Has common prefix
2010-7-31
Some internal nodes have suffix links from the
Ukkonen’s algorithm in Step #1
Some internal nodes are created in the merging
step and do not have suffix links
and stored suffix trees on disk (does not help
speed this step up, so discard to simplify)
Should suffix links are required, use the suffix
link recovery algorithm to rebuild them
2010-7-31
121
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
For each prefixed suffix tree, recursively call this function
from the tree’s root.
x: an internal node
L: be edge label between x and parent(x)
RECOVER(x, L)
if (x == root) sl(x)
x;
else {
1. p = parent(x);
2. q = sl(p); //get suffix link of p, and load the prefix tree
for q from disk if not in memory
3. Skip/count using L to locate sl(x) under q; }
for (each internal child y of x)
RECOVER(y, edge-label(x,y));
2010-7-31
Experimental Results
Trellis vs TDD
1000
Time (mins)
Time (mins)
Construction Time
Trellis vs TOP-Q and DynaCluster
100
10
1
0
20
40
60
80
100
120
400
300
200
100
0
200
400
Sequence Length (Mbp)
TOP-Q (mins)
DynaCluster (mins)
600
800
1000
Sequence Length (Mbp)
Trellis (mins)
TDD
Trellis
Total Trellis
• Memory: 512MB
• Memory: 512 MB
• TOP-Q and DynaCluster parameters were
set as recommended in their papers
Human genome suffix tree
(size ~3Gbp, using 2GB of memory)
Trellis
TDD: 12.8hr
• Without
5.9hr
2010-7-31
122
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Experimental Results (cont.)
Disk Space Usage
Disk-based Suffix Tree Size
Trellis vs TDD
27 bytes per character indexed while
Size (GB)
30
20
10
0
200
400
600
800
1000
Sequence Length (Mbp)
Trellis
Human Genome
Trellis
72GB
TDD
For the human genome, TDD uses
requires 64-bit environment to index
larger sequences.
Trellis remains at 27 bytes/char for
the human genome.
TDD
54GB
2010-7-31
Experimental Results (cont.)
TDD
Trellis vs TDD
Query Times on the Human Genome Suffix Tree
Query Length (bp)
8000
Trellis
TDD
4000
1000
600
200
80
40
0.000
0.050
0.100
0.150
0.200
Query Time (secs)
Hence, faster query time!
2010-7-31
123
• smaller suffix trees
• edge length must be determined
by examining all children nodes
• each internal node only has a
pointer to its first child, i.e. children
must be linearly scanned during
a query search
Trellis
• larger suffix trees
• edge length stored locally with its
respective node
• all children locations stored locally,
so each child can be accessed in a
constant time, i.e. no linear scan
needed
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Experimental Results
(cont.)
S[150]
xαG
C
Query length = 100
α
xα
v
A
• Uses suffix links to move
across the tree to search for
the next query
• Mimics the behavior of
exact match anchor search
during a genome alignment
sf(v)
G
G
A
CA
2010-7-31
Experiment Results (cont.)
Query Times on the Human Genome Suffix Tree
Query Length (bp)
8000
4000
1000
600
200
80
40
0.000
0.010
0.020
0.030
Query Time (secs)
2010-7-31
124
0.040
0.050
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Summary
Trellis builds a disk-based suffix tree based on
A partitioning method via variable-length prefixes
A suffix subtree merging algorithm
Trellis is both time and space efficient
Faster than existing leading methods in both
construction and query time
2010-7-31
Outline
• Searching based on a disk based suffix tree
• Approximate Matching Using Inverted List (Vgrams)
• Approximate Matching Based on B+ Tree (BED Tree)
2010-7-31
125
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Example 1: a movie database
Tom
Find movies starred Samuel Jackson
Year
Genre
Keanu Reeves
Star
The Matrix
Title
1999
Sci-Fi
Samuel Jackson
2005
Sci-Fi
Schwarzenegger
Star Wars: Episode III - Revenge
of the Sith
The Terminator
1984
Sci-Fi
Samuel Jackson
Goodfellas
1990
Drama
…
…
…
…
2010-7-31
The user doesn’t know the exact spelling!
Year
Genre
Keanu Reeves
Star
The Matrix
Title
1999
Sci-Fi
Samuel Jackson
2005
Sci-Fi
Schwarzenegger
Star Wars: Episode III - Revenge
of the Sith
The Terminator
1984
Sci-Fi
Samuel Jackson
Goodfellas
1990
Drama
…
…
…
…
2010-7-31
126
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Relax Condition
Find movies with a star “similar to” Schwarrzenger.
Year
Genre
Keanu Reeves
Star
The Matrix
Title
1999
Sci-Fi
Samuel Jackson
2005
Sci-Fi
Schwarzenegger
Star Wars: Episode III - Revenge
of the Sith
The Terminator
1984
Sci-Fi
Samuel Jackson
Goodfellas
1990
Drama
…
…
…
…
2010-7-31
Edit Distance
Given two strings A and B, edit A to B with the
minimum number of edit operations:
Replace a letter with another letter
Insert a letter
Delete a letter
E.g.
A = interestings
B = bioinformatics
_i__nterestings
bioinformatic_s
101101101100110
Edit distance = 9
2010-7-31
127
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Edit Distance Computation
Instead of minimizing the number of edge operations, we
can associate a cost function to the operations and
minimize the total cost. Such cost is called edit distance.
For the previous example, the cost function is as follows:
A= _i__nterestings
B= bioinformatic_s
101101101100110
_
Edit distance = 9
_
Delete i times
2010-7-31
128
T
1
1
1
1
0
1
1
1
C
1
1
0
1
1
G
1
1
1
0
1
T
1
1
1
1
0
Consider two strings S[1..n] and T[1..m].
Define V(i, j) be the score of the optimal
alignment between S[1..i] and T[1..j]
Basis:
V(i, 0) = V(i-1, 0) + δ(S[i], _)
G
1
Needleman-Wunsch algorithm (I)
Insert j times
C
A
2010-7-31
V(0, 0) = 0
V(0, j) = V(0, j-1) + δ(_, T[j])
A
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Needleman-Wunsch algorithm (II)
Recurrence: For i&gt;0, j&gt;0
⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ])
⎪
V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _)
⎪ V (i, j − 1) + δ (_, T [ j ])
⎩
Match/mismatch
Delete
Insert
In the alignment, the last pair must be either
match/mismatch, delete, insert.
xxx…xx
|
xxx…yy
xxx…xx
|
yyy…y_
match/mismatch
delete
xxx…x_
|
yyy…yy
insert
2010-7-31
Example (I)
_
A
G
_
0
-1 -2 -3 -4 -5 -6 -7
A
-1
C
-2
A
-3
A
-4
T
-5
C
-6
C
-7
2010-7-31
129
C
A
T
G
C
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Example (II)
_
A
G
C
A
T
G
C
_
0
-1 -2 -3 -4 -5 -6 -7
A
-1
2
1
0
C
-2
1
1
?3
2
A
-3
A
-4
T
-5
C
-6
C
-7
_
A
G
C
A
_
0
-1 -2 -3 -4 -5 -6 -7
A
-1
2
1
0
-1 -2 -3 -4
C
-2
1
1
3
2
1
0
-1
A
-3
0
0
2
5
4
3
2
A
-4 -1 -1
1
4
4
3
2
T
-5 -2 -2
0
3
6
5
4
C
-6 -3 -3
0
2
5
5
7
C
-7 -4 -4 -1
1
4
4
7
-1 -2 -3 -4
2010-7-31
Example (III)
2010-7-31
130
T
G
C
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
“q-grams” of strings
universal
2-grams
2010-7-31
q-gram inverted lists
id
0
1
2
3
4
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
4
2010-7-31
131
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Searching using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
sh ht ti ic ck
id
0
1
2
3
4
strings
rich
stick
stich
stuck
static
2-grams
at
ch
ck
ic
ri
st
ta
ti
tu
uc
# of common grams &gt;= 3
4
0
2
1
0
0
1
4
1
3
3
3
1
2
4
2
3
4
2
4
2010-7-31
2-grams -&gt; 3-grams?
Query: “shtick”, ED(shtick, ?)≤1
sht hti tic ick
id
0
1
2
3
4
2010-7-31
strings
rich
stick
stich
stuck
static
3-grams
ati
ich
ick
ric
sta
sti
stu
tat
tic
tuc
uck
# of common grams &gt;= 1
4
0
1
0
4
1
3
4
1
3
3
132
2
2
2
4
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Observation 1: dilemma of choosing “q”
Increasing “q” causing:
Longer grams Shorter lists
Smaller # of common grams of similar strings
4
at
2
ch
0
id strings
ck
1
3
0
rich
ic
0
1
2
4
2-grams
1
stick
ri
0
2
stich
st
4
2
3
1
3
stuck
ta
4
4
static
ti
1
2
4
tu
3
uc
3
2010-7-31
Observation 2: skew distributions of gram
frequencies
DBLP: 276,699 article titles
Popular 5-grams: ation (&gt;114K times), tions, ystem, catio
2010-7-31
133
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
VGRAM: Main idea
Grams with variable lengths (between qmin and qmax)
zebra
ze(123)
corrasion
co(5213), cor(859), corr(171)
Reduce index size ☺
Reducing running time ☺
2010-7-31
Challenges
Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
2010-7-31
134
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Challenge 1: String
Variable-length grams?
Fixed-length 2-grams
universal
Variable-length grams
[2,4]-gram dictionary
universal
ni
ivr
sal
uni
vers
2010-7-31
Representing gram dictionary as
a trie
ni
ivr
sal
uni
vers
2010-7-31
135
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Challenge 2: Constructing gram
dictionary
Step 1: Collecting frequencies of grams with length in [qmin,
qmax]
st
sti
stu
stic
stuc
0, 1, 3
0, 1
3
0, 1
3
Gram trie with frequencies
2010-7-31
Step 2: selecting grams
Pruning trie using a frequency threshold T (e.g., 2)
2010-7-31
136
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Step 2: selecting grams (cont)
Threshold T = 2
2010-7-31
Final gram dictionary
[2,4]-grams
2010-7-31
137
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Challenge 3: Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
2010-7-31
Deletion affects variable-length grams
Not affected
i-qmax+1
Affected
i
Deletion
2010-7-31
138
Not affected
i+qmax- 1
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Grams affected by a deletion
Affected?
i
Deletion
i-qmax+1
i+qmax- 1
[2,4]-grams
ni
ivr
sal
uni
vers
Deletion
universal
Affected?
2010-7-31
Grams affected by a deletion (cont)
Affected?
i-qmax+1
i
Deletion
Trie of grams
i+qmax- 1
Trie of reversed grams
2010-7-31
139
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
# of grams affected by each operation
Deletion/substitution
Insertion
0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0
_u_n_i_v_e_r_s_a_l_
2010-7-31
Max # of grams affected by k operations
Vector of s = &lt;2,4,6,8,9&gt;
With 2 edit operations, at most 4 grams can be affected
Called NAG vector (# of affected grams)
Precomputed and stored
2010-7-31
140
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Summary of VGRAM index
2010-7-31
Basic interfaces:
String s grams
String s1, s2 such that ed(s1,s2) &lt;= k
common grams
2010-7-31
141
min # of their
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Lower bound on # of common grams
Fixed length (q)
universal
If ed(s1,s2) &lt;= k, then their # of common grams
&gt;=:
(|s1|- q + 1) – k * q
Variable lengths: # of grams of s1 – NAG(s1,k)
2010-7-31
Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
sh
ht
tick
2-grams
…
ck
ic
…
ti
…
2010-7-31
2-4 grams
Lower bound = 3
1
0
3
1
2
1
2
4
id
0
1
2
3
4
4
strings
rich
stick
stich
stuck
static
142
…
ck
ic
ich
…
tic
tick
…
1
1
0
3
4
2
2
1
4
Lower bound = 1
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
PartEnum + VGRAM
PartEnum, fixed q-grams:
ed(s1,s2) &lt;= k
hamming(grams(s1),grams(s2)) &lt;= k * q
VGRAM:
ed(s1,s2) &lt;= k
hamming(VG (s1),VG(s2)) &lt;= NAG(s1,k) +
NAG(s2,k)
2010-7-31
PartEnum + VGRAM (na&iuml;ve)
R
S
Bm(S) = max(NAG(s,k))
Bm(R) = max(NAG(r,k))
• Both are using the same gram dictionary.
• Use Bm(R) + Bm(S) as the new hamming bound.
2010-7-31
143
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
PartEnum + VGRAM (optimization)
R
S
R1 with Bm(R1)
R2 with Bm(R2)
R3 with Bm(R3)
Bm(S) = max(NAG(s,k))
• Group R based on the NAG(r,k) values
• Join(R1,S) using Bm(R1) + Bm(S)
• Similarly, Join(R2,S), Join(R3,S)
• Local bounds tighter
better signatures generated
• Grouping
S
also
possible.
2010-7-31
Outline
• Searching based on a disk based suffix tree
• Approximate Matching Using Inverted List (Vgrams)
• Approximate Matching Based on B+ Tree (BED Tree)
2010-7-31
144
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Approximate String Search
Information Retrieval
Web search query with string “Posgre SQL” instead of
“Postgre SQL”
Data Cleaning
“13 Computing Road” is the same as “#13 Comput’ng Rd”?
Bioinformatics
Find out all protein sequences similar to
“ACBCEEACCDECAAB”
2010-7-31
71
Edit Distance
Edit distance on strings
13 Computing Drive
3 deletions
Edit distance: 5
13 Computing Dr
1 replacement
13 Comput’ng Dr
1 insertion
#13 Comput’ng Dr
Normalized edit distance
ED(s1,s2)
5
MaxLength(s1,s2)
18
2010-7-31
72
145
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Existing Solution
Q-Gram
Q=3
Postgre
##P #Po Pos ost stg tgr gre re# e##
Posgre
##P #Po Pos osg sgr gre re#
e##
Observation: If ED(s1,s2)=d, they agree on at least
min(|s1|,|s2|)+Q-1-d*(Q+1) grams
2010-7-31
73
Existing Solution
Inverted List
Postgre
##P
#Po
Pos
osg
sgr
gre
re\$
e\$\$
Posgre
2010-7-31
74
146
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Limitations
Inverted List Method
Limited queries supported
Range Query Join Query
Top-K Query
Top-K Join
Edit Distance
N
N
Normalized ED
N
N
Y
Y
Uncontrollable memory
consumption
N
N
Concurrency protocol
2010-7-31
75
Our Contributions
Bed-Tree
Wide support on different queries and distances
Range Query Join Query
Edit Distance
Y
Normalized
EDbuffer size
Y
Y
Y cost
and low I/O
Top-K Query
Top-K Join
Y
Y
Y
Y
Highly concurrent
Easy to implement
Competitive performance
2010-7-31
76
147
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Basic Index Framework
Bed-Tree Framework
Index Construction
follows standard B+
tree
Estimate the minimal
distance to query and
prune B+ tree nodes
Query: Posgre
Map all strings to a
1D domain
Refine the result by exact
edit distance
Result: Postgre
2010-7-31
77
String Order Properties
P1: Comparability
Given two string s1 and s2, we know the order of s1 and s2
under the specified string order
P2: Lower Bounding
Given an interval [L,U] on the string order, we know a
lower bound on edit distance to the query string
Query: Posgre
Candidates in the
sub-tree?
2010-7-31
78
148
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
String Order Properties
P3: Pairwise Lower Bounding
Given two intervals [L,U] and [L’,U’], we know the lower
bound of edit distance between s1 from [L,U] and s2 from
[L’,U’]
P4: Length Bounding
Given an interval [L,U] on the string order, we know the
minimal length of the strings in the interval
Potential join results?
2010-7-31
79
String Order Properties
Properties v.s. supported queries and distances
Range Query Join Query
Top-K Query
Top-K Join
Edit Distance
P1, P2
P1, P3
P1, P2
P1, P3
Normalized ED
P1, P2, P4
P1, P3, P4
P1, P2, P4
P1, P3, P4
Description
P1
Comparability
P2
Lower Bounding
P3
Pair-wise Lower Bounding
P4
Length Bounding
2010-7-31
80
149
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Dictionary Order
All strings are ordered alphabetically, satisfying P1, P2 and
P3
Search: Posgre with ED=1
Insertion: Postgre
pose
powder
It’s between “pose”
and “powder”
sit
2010-7-31
81
Dictionary Order
All strings are ordered alphabetically, satisfying P1, P2 and
P3
Search: Posgre with ED=1
pose
powder
power
Not pruning
anything!
sit
put
Pruning happens
only when long
prefix exists
2010-7-31
82
150
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Gram Counting Order
Jim Gray
Hash all grams to
4 buckets
2010-7-31
Count the grams
in binary
1 1
1 0
0 1
1 1
Gram Counting Order
Transform the count vector to a bit string with z-order
Encode with zorder
Order the strings
with this signature
2010-7-31
84
151
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Gram Counting Order
Query: Jim Gary
Lower Bounding
“11011011” to “11011101”
Prefix: “11011???”
signature:
(4,1,2,2)
Minimal edit
distance: 1
2010-7-31
85
Gram Location Order
Extension of Gram Counting Order
Include positional information of the grams
Jim Gray
Grace Hopper
Allow better estimation of mismatch grams
Harder to encode
2010-7-31
86
152
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Experiment Settings
Data
Five Index Schemes
Bed-Tree: BD, BGC, BGL
Inverted List: Flamingo, Mismatch
Default Setting
Q=2, Bucket=4, Page Size=4KB
2010-7-31
87
Empirical Observations
How good is Bed-Tree?
With small threshold, Inverted Lists are better
When threshold increases, Bed-Tree is not worse
153
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Empirical Observations
Which string order is better?
Gram counting order is generally better
Gram Location order: tradeoff between gram content
information and position information
Conclusion
A new B+ tree index scheme
All similarity queries supported
Both edit distance and normalized distance
General transaction and concurrency protocol
competitive efficiencies
2010-7-31
90
154
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
References
Benjarath Phoophakdee, Mohammed J. Zaki:
&quot;Genome-scale disk-based suffix tree indexing&quot;.
SIGMOD Conference 2007: 833-844
Chen Li, Bin Wang, and Xiaochun Yang . &quot;VGRAM:
Improving Performance of Approximate Queries on
String Collections Using Variable-Length Grams&quot;. In
VLDB 2007.
• Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi,
and Divesh Srivastava, &quot;B^{ed}-Tree: An All-Purpose
Tree Index for String Similarity Search on Edit
Distance&quot;. SIGMOD 2010.
155
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Searching and Mining Complex
Structures
Similarity Search on Trees
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Outline
Importance of Trees
Distance between Trees
Fast Edit Distance Approximation for Trees
2
156
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Importance of Trees
Between sequences and graphs
Equivalent to acyclic graph
Represents hierarchal structures
Examples
XML documents
Programs
RNA structure
3
Types of Trees
Is there a root?
Are the nodes labeled?
Are the children of a node ordered?
4
157
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Outline
Importance of Trees
Distance between Trees
Fast Edit Distance Approximation for Trees
5
Distance Measure
Many ways to define distance
Convert to standard types and adopt the distance metric there
How many operations to transform one tree to another? (Edit
distance)
Inverse of similarity
dist(S, T) = maxSim – sim(S,T)
Relationship between different definitions?
6
158
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Operations on Trees
Relabel
Delete
Insert
7
Remarks on Edit Distance
Ordered trees are tractable
Approach based on dynamic programming
NP-hard for unordered trees
Approach is to impose restrictions so that DP can be used
8
159
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Edit Script
Edit script(S, T): sequence of operations to transform S to T
Example
1. S=
2. Delete c
3. Insert c
Relabel f → a
Relabel e → d
9
Edit Distance Mapping
Edit distance mapping(S, T): alternative representation of edit
operations
relabel: v → w
delete: v → \$
insert: \$ → w
Mapping corresponding to the script
10
160
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Edit Distance for Ordered Trees
Generalize the problem to forests.
C(φ, φ) = 0
C(S, φ) = C(S – v, φ) + cost(v → \$)
C(φ, T) = C(φ, T – w) + cost(\$ → w)
C(S, T) = minimum of
1. C(S – v, T) + cost(v → \$)
[deleting v]
2. C(S, T – w) + cost(\$ → w) [inserting w]
3. C(S – tree(v), T – tree(w)) +
C(S(v) - v, T(w)) + cost(v → w)[relabel v → w]
11
Illustration of Case 3
C(S – tree(v), T – tree(w)) +
C(S(v), T(w)) + cost(v → w) [relabel v → w]
S - tree(v)
...
T - tree(w)
v
...
w
T(w)
S(v)
12
161
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Algorithm Complexity
Number of subproblems bounded by O(|S|2|T|2)
Zhang and Shasha, 1989 showed that the number of relevant
subproblems is
O(|S||T|min(SD, SL) min(TD, TL)) and space is O(|S||T|)
Further improvements, required decomposition of a rooted tree
into disjoint paths
13
Decomposition into Paths
Concept of heavy and light nodes/edges
(Harel and Tarjan, 1984)
Root is light, child with max size is heavy
Removal of light edges partitions T into disjoint heavy paths
Important property: light depth(v) ≤ log|T| + O(1)
Complexity can be reduced to O(|S|2|T|log|T|)
14
162
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Unordered Edit Distance
NP-hard
Special cases (in P)
T is a sequence
Number of leaves in T is logarithmic
Disjoint subtrees map to disjoint subtrees
15
Tree Inclusion
Is there a sequence of deletion operations on S which can
transform it to T?
Special case of edit distance which only allows deletions
16
163
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Complexity of Tree Inclusion
Ordered trees
Concept of embeddings (restriction of mappings)
O(|S||T|) using the algorithm of
Kilpelainen and Mannila
Unordered trees
NP-complete (what did you expect ?)
Special cases
17
Related Problems on Trees
Tree Alignment (covered in the survey paper)
Robinson-Fould's Distance for leaf labeled trees, where edge =
bipartition of leaves
Tree Pattern Matching
Maximum Agreement Subtree
Largest Common Subtree
Smallest Common Supertree
Many are generalizations of problems on strings
18
164
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Summary of Tree Distance
Edit distance
Concept of edit mapping
Dynamic programming for ordered trees
Constrained edit distance for unordered trees
Tree inclusion
Special case of edit distance
Specialized algorithms are more efficient
Useful for determining embedded trees
19
Outline
Importance of Trees
Distance between Trees
Fast Edit Distance Approximation for Trees
20
165
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Similarity Measurement

Edit Distance EDist(T1, T2)
e; cost γ(e),
Edit Operation
b-&gt;λ
a-&gt;b
λ-&gt;b
a
a
si(ei1,ei2,…,eik) : T1-&gt;T2; cost(si)= ∑j γ(eij)
EDist(T1,T2)=mini(cost(si)) unit cost: EDist(T1,T2)=min(k)
Computational Complexity:
O (| T1 | &times; | T2 | &times; min(depth(T1 ), leaves(T1 )) &times; min(depth(T2 ), leaves(T2 )))
7/31/2010
21
Edit Operation Mapping

Edit operations mapping



One-to-one
Preserve sibling order
Preserve ancestor order
a
a
d
M(T1,T2)
c
b
e
c
d
b
c
d
T1
7/31/2010
c
d
d
b
e
T2
22
166
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Observation
Edit operations do not change many sibling
relationship
a
a
c-&gt;λ
c
b
d
e
f
b
g
f
h
d
i
e
Sibling relation:
(b,c)-&gt;(b,f)
(c,d)-&gt;(i,d)
i
h
g
Node: Varying number of children v.s. at most 2 siblings
7/31/2010
23
Binary Tree Representation
a
Binary Tree Representation
Left-child, right sibling
Normalized Binary Tree
b
c
a(1,8)
b(2,3)
ε
e(8,7)
c(6,4)
d(4,2)
ε
ε
ε
d(7,5)
ε
c
d
T1
c(3,1)
ε
d
d
a
d
e
b
b
b
d
b
c
b … c … c … c … e … ε … ε …ε … ε … ε
e
b
e
ε
ε b
ε
ε d
c
ε
b(5,6)
e
b
1 …1 …0 … 1 … 0 … 2
…0 …0 … 2 … 1
T2
ε
1 …0 …1 … 0 … 1 … 2
ε
…0 …1 … 0 … 1
1
|Γ |
BBDist (T1 , T2 ) = ∑ | b1i − b2i | = 8
i =1
Triangular Inequality
7/31/2010
24
167
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
One Edit Operation Effect
v’
...
w2
w1
v’
...
...
wl
...
...
w l+m+1
w l+m
w1
...
...
...
v
wl
w2
w l+m+1
...
...
Each node appears in
at most two binary
branches
...
w l+1
v’
v’
...
w1
...
w2
w l+m
...
w2
...
wl
w l+1
...
...
w1
...
...
...
w l+m
wl
v ...
w l+m+1
w l+1
w l+m+1
...
w l+2
...
...
w l+m
...
ε
7/31/2010
25
Theorem
1 insertion/deletion incurs at most 5 difference on BBDist
1 rellabeling incurs at most 4 difference on BBDist
T, T’, EDIST(T, T’) = k = ki + kd + kr ,
BDist(T,T’) &lt;= 4kr+5ki+5kd &lt;= 5k;
1/5 BDist is a lower
bound of edit distance;
7/31/2010
26
168
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Positional Binary Branch
a(1,8)
a(1,8)
b(2,3)
ε
b(5,6)
ε
ε
e(8,7)
c(6,4)
d(4,2)
ε
ε
ε
d(7,5)
ε
ε
c(3,1)
c(3,1)
c(7,6)
ε
d(4,2)
ε
ε
ε
ε
d
b
e
c
d
b
c
d
d(8,7)
ε
b(5,4)
e(6,3)
ε
a
a
b(2,5)
ε
c
d
c
ε
b
T2
T1
ε
d
e
B(T2)
B(T1)
a
Incurs 0 difference for
BBDist(T1,T2)
c
d
T’2
Positional binary branch: PosBiB(T(u))
PosBiB(T1(e))=(BiB(e,ε,ε),8,7)
≠
e
PosBib(T2(e))=(BiB(e,ε,ε),6,3)
Positional Binary Branch Distance
7/31/2010
27
Computational Complexity
D: dataset; |D|: dataset size;
Vector construction part:
Traverse the data trees for once
Optimistic bound computation:
time: each binary search O(|Ti|+|Tq|),
| D|
time, space : O(∑ | Ti |)
i =1
| D|
O (∑ (| Ti | + | Tq |) log(min(| Ti |,| Tq |)))
totally:
i =1
| D|
space:
O (∑ (| Ti | + | Tq |))
i =1
7/31/2010
28
169
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Generalized Study
Extend the sliding window to q level
The images vector gives multiple level binary
branch profiles.
BDist_q(T,T’
(T,T’) &lt;= [4*(q[4*(q-1)+1]*EDist(T,T’
1)+1]*EDist(T,T’)
v’
v’
...
w1
...
w2
...
w2
...
wl
w l+1
...
...
w1
...
...
...
w l+m
wl
v ...
w l+m+1
w l+1
w l+m+1
...
w l+2
...
...
w l+m
...
7/31/2010
29
Query Processing Strategy
Filter-and-refine frameworks
Lower bound distances filter out most objects
The lower bound computation is much succinct
Lower bound distance is a close approximation of
the real dist
Remaining objects be validated by real distance
7/31/2010
30
170
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Experimental Settings
Compare with histogram methods[KKSS04]
Lower bound: feature vector distance (Leaf Distance Height
histogram vector, Degree histogram vector, Label histogram
vector)
Synthetic dataset:
Tree size, Fanout, Label, Decay factor
Real dataset: dblp XML document
Performance measure:
Percentage of data accessed:
| false positive | + | true positive |
&times; 100%
| dataset |
CPU time consumed
Space requirement
7/31/2010
31
Sensitivity to the Data Properties
Sensitivity test
Range: N{}N{50,2.0}L8D0.05
20
0.3
15
0.2
10
0.1
5
0
0
2
4
BiBranch %
BiBranch
6
Fanout
Histo %
8
Result %
3
80
70
60
50
40
30
20
10
0
2.5
2
1.5
1
0.5
0
25
Sequ
CPU Cost (Second)
0.4
25
% of Accessed Data
30
CPU Cost (Second)
% of Accessed Data
Range: N{4,0,5}N{}L8D0.05
0.5
35
50
BiBranch %
BiBranch
75
Tree Size
His to %
125
Result %
Sequ
KNN: N{}N{50,2.0}L8D0.05
0.5
0.2
0.1
0
2
BiBranch %
4
Fanout
Histo %
6
8
BiBranch
3
2.5
80
2
60
1.5
40
1
20
0.5
0
0
25
Sequ
BiBranch %
mean(fanout): 2 &AElig; 8;
mean(|T|): 50;
size(label): 8
CPU Cost (Second)
0.3
100
% of Accessed Data
0.4
CPU Cost (Second)
% of Accessed Data
KNN: N{4,0.5}N{}L8D0.05
8
7
6
5
4
3
2
1
0
50
75
Tree Size
Histo %
125
BiBranch
Sequ
mean(|T|): 25 &AElig; 125;
mean(fanout): 4;
size(label): 8
7/31/2010
32
171
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Sensitivity test (cont.)
Range: N{4,0.5}N{50,2.0}L{}D0.05
0.4
25
20
0.3
15
0.2
10
0.1
5
% of Accessed Data
30
7
0
0
8
16
32
Label Number
BiBranch %
Histo %
BiBranch
Sequ
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
6
5
4
3
2
1
0
64
8
16
32
Label Number
Result %
BiBranch %
Histo %
CPU Cost (Second)
KNN: N{4,0.5}N{50,2.0}L{}D0.05
0.5
CPU Cost (Second)
% of Accessed Data
35
64
BiBranch
Sequ
size(label): 8 &AElig; 64; mean(|T|): 50; mean(fanout): 4
7/31/2010
33
Queries with Different Parameters



Dblp data (avg. distance: 5.031)
Range queries
KNN (k:5-20)
Range: DBLP
KNN: DBLP
0.2
40
0.15
0.1
20
0.05
0
0
1
2
3
4
5
7
10
Range
BiBranch %
BiBranch
Histo %
Sequ
6
0.35
5
0.3
0.25
4
0.2
3
0.15
2
0.1
1
0.05
0
0
Result %
5
7
BiBranch %
7/31/2010
CPU Cost (second)
0.25
60
% of Accessed Data
% of Accessed Data
0.3
80
CPU Cost (second)
0.35
100
10
12
k
Histo %
15
17
BiBranch
20
Sequ
34
172
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
Pruning Power of Different Level
Data distribution according to distances
Edit distance
Histogram distance
Binary branch distance: 2, 3, 4 level
DBLP
2000
Data Distribution
1500
1000
500
0
1
2
Edit
BiBranch(3)
3
4
5
6
7
Distance
Histo
BiBranch(4)
8
9
10
11
12
BiBranch(2)
7/31/2010
35
Citations on the Paper
Surprisingly, attract citations and questions from software
engineering! Expect more impact along software mining
direction soon.
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones - all 2
versions &raquo;
L Jiang, G Misherghi, Z Su, S Glondu - Proceedings of the 29th International
Conference on Software …, 2007 - portal.acm.org
Detecting code clones has many software engineering applications. Existing
approaches either do not scale to large code bases or are not robust against minor
code modifications. In this paper, we present an efficient ...
Fast Approximate Matching of Programs for Protecting Libre/Open Source Software
by Using Spatial … - all 2 versions &raquo;
AJM Molina, T Shinohara - Source Code Analysis and Manipulation, 2007. SCAM
2007. …, 2007 - doi.ieeecomputersociety.org
To encourage open source/libre software development, it is desirable to have
tools that can help to identify open source license violations. This paper
describes the imple-mentation of a tool that matches open source programs ...
7/31/2010
36
173
Mining and Searching Complex
Chapter 4 Structures Similarity Search on Trees
References
• Philip Bille . A survey on tree edit distance and related problems.
Theoretical Computer Science. Volume 337 , Issue 1-3 (June 2005)
• Rui Yang, Panos Kalnis, Anthony K. H. Tung: Similarity Evaluation on
Tree-structured Data. SIGMOD 2005.
• Optional References:
• JP Vert. &quot;A tree kernel to analyze phylogenetic profiles&quot; - Bioinformatics,
2002 - Oxford Univ Press
7/31/2010
174
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Searching and Mining Complex
Structures
Graph Similarity Search
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
175
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Smart Graphs
Chemical compound Protein structure
Program flow
Coil
Image
Fingerprint
Letter
Motivation
• Why graph?
•Graph is ubiquitous
•Graph is a general model
•Graph has diversity
•Graph problem is complex and challenging
• Why graph search?
•Manifold application areas
•
•
•
•
2D and 3D image analysis
Video analysis
Document processing
Biological and biomedical applications
176
Shape
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Graph Search
• Definition
•Given a graph database D and a query graph Q, find all
graphs in D supporting the users’ requirements:
•
•
•
•
The same as Q
Containing Q or contained by Q
Similarity to Q
Similarity to the subgraph of Q
• Challenge
•How to efficiently compare two graphs?
•How to reduce the number of pairwise graph comparisons?
How to efficiently compare two graphs?
• The graph matching problem
•Graph matching is the process of finding a
correspondence between the vertices and the edges of two
graphs that satisfies some (more or less stringent)
constraints ensuring that similar substructures in one graph
are mapped to similar substructures in the other.
177
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
How to reduce the number of pairwise
graph comparisons?
• Scalability issue
•A full database scan
•Complex graph matching between a pair of graphs
• Index mechanisms are needed
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
178
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Categories of Matching
Exact Graph Matching
• Graph Isomorphism
•Two graphs G1=(V1,E1) and G2=(V2,E2) are isomorphic if
there is a bijective function f: V1 → V2 such that for all
u, v ∈ V1: {u,v} ∈ E1 ↔ {f(u),f(v)} ∈ E2
G1
G2
179
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Exact Graph Matching
• Induced Subgraph
•A subset of the vertices of a graph together with all edges
whose endpoints are both in this subset
• Subgraph Isomorphism
•An isomorphism holds between one of the two graphs and
an induced subgraph of the other
Graph Similarity Measure
• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges
G1
G2
180
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Graph Similarity Measure
• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges
G1
G2
Graph Similarity Measure
• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges
G1
G2
181
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Graph Similarity Measure
• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges
G1
G2
Graph Similarity Measure
• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges
G1
G2
182
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Graph Similarity Measure
• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges
G1
G2
Graph Similarity Measure
• Graph Edit Distance (GED)
•Given two attributed graphs G1 = (V1,E1, Σ, l1) and G2 =
(V2,E2, Σ, l2) , the GED between them is defined as
•where T(G1,G2) denotes the set of edit paths transforming G1
into G2, and c denotes the edit cost function measuring the
c(ei) of edit operation ei
• GED provides a general dissimilarity measure for graphs
• Most works on inexact graph matching focusing on the
GED computation problem
183
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
Exact Matching Algorithms
• Tree search based algorithms
•Ullmann’s algorithm
•VF and VF2 algorithm
• Other algorithms
•Nauty algorithm
184
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Tree Search based Algorithms
• Basic Idea
•A partial match (initially empty) is iteratively expanded by
adding new pairs of matched vertices
•The pair is selected using some necessary conditions,
usually also some heuristic condition to prune unfruitful
search paths
•The algorithm ends when it finds a complete matching, or no
further vertex pairs may be added (backtracking)
•For attributed graphs, the attributes of vertices and edges can
be used to constrain the desired matching
The Backtracking Algorithm
1
• Depth-First Search (DFS):
2
5
•progresses by expanding the first child node of the search
3
4
7
tree
•going deeper and deeper until a goal node is found, or until
it hits a node that has no children.
• Branch and Bound (B&amp;B):
•BFS(breadth-first search)-like search for optimal solution
•Branch is that a set of solution candidates is splitted into two
or more smaller sets
•bound is that a procedure upper and lower bounds
1
2
5
185
6
3
7
6
8
4
8
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Tree Search based Algorithms
• Ullmann’s Algorithm (DFS)
•A refinement procedure based on matrix of possible future
matched vertex pairs to prune unfruitful matches
•The simple enumeration algorithm for the isomorphisms
between a graph G1 and a subgraph of another graph G2 with
• An M’ matrix with |V1| rows and |V2 | columns can be used
to permute the rows and columns of A2 to produce a further
matrix P. If
, then M’ specifies an
isomorphism between G1 and the subgraph of G2.
(a1 i , j = 1) ⇒ ( pi , j = 1)
P = M ' ( M ' A2 )T
Tree Search based Algorithms
• Ullmann’s Algorithm
•Example for permutation matrix
•The elements of M’ are 1’s and 0’s, such that each row
contains 1 and each column contains 0 or 1
P = M ' (M ' A2 )T
G2
⎡
⎡0
⎡1 0 0 0⎤ ⎢⎡1 0 0 0⎤ ⎢
1
= ⎢⎢0 0 1 0⎥⎥ ⋅ ⎢⎢⎢0 0 1 0⎥⎥.⎢
⎢
⎢0
⎣⎢0 1 0 0⎥⎦ ⎢⎣⎢0 1 0 0⎦⎥ ⎢0
⎣
⎣⎢
⎡0
⎡1 0 0 0⎤ ⎢
1
= ⎢⎢0 0 1 0⎥⎥.⎢
⎢0
⎢⎣0 1 0 0⎥⎦ ⎢
⎣0
186
1
0
1
1
0 1⎤
⎡0 0 1⎤
1 0⎥⎥ ⎢
= 0 0 1⎥⎥
0 1⎥ ⎢
⎥ ⎣⎢1 1 0⎦⎥
0 1⎦
0
1
0
0
0⎤⎤
⎥
1⎥⎥
⎥
0⎥⎥
⎥⎥
0⎦⎦⎥
T
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Tree Search based Algorithms
• Ullmann’s Algorithm
•Construction of another matrix M0 with the same size of M’
⎧1 if deg(V2i ) ≥ deg(V1i )
mi0, j = ⎨
, mi , j ∈ {0,1}
otherweise
⎩0
•Generation of all M’ by setting all but one of each row of M0
•A subgraph isomorphism has been found if
(a1 i , j = 1) ⇒ ( pi , j = 1)
⎡0
⎢1
A2 = ⎢
⎢0
⎢
⎣0
1 0 0⎤
0 1 1⎥⎥
1 0 0⎥
⎥
1 0 0⎦
⎡0 0 1⎤
A1 = ⎢⎢0 0 1⎥⎥
⎢⎣1 1 0⎥⎦
G2
G1
⎡1 1 1 1⎤
M 0 = ⎢⎢1 1 1 1⎥⎥
⎢⎣0 1 0 0⎥⎦
Tree Search based Algorithms
• Ullmann’s Algorithm
•An example
⎡1 1 1 1 ⎤
⎢1 1 1 1 ⎥
⎥
⎢
⎣⎢0 1 0 0⎦⎥
⎡1 0 0 0 ⎤
⎢1 1 1 1 ⎥
⎢
⎥
⎣⎢0 1 0 0⎦⎥
⎡1 0 0 0 ⎤
⎢0 0 1 0⎥
⎢
⎥
⎣⎢0 1 0 0⎦⎥
1
4
3
⎡1 0 0 0 ⎤
⎢0 0 0 1 ⎥
⎢
⎥
⎣⎢0 1 0 0⎦⎥
1
2
3
2
1
⎡0 1 0 0⎤
⎢1 1 1 1 ⎥
⎢
⎥
⎣⎢0 1 0 0⎦⎥
4
2
3
⎡0 0 1 0⎤
⎢1 1 1 1 ⎥
⎢
⎥
⎣⎢0 1 0 0⎦⎥
⎡0 0 1 0⎤
⎢1 0 0 0 ⎥
⎢
⎥
⎢⎣0 1 0 0⎥⎦
2
4
3
3
⎡0 0 1 0⎤
⎢0 0 0 1 ⎥
⎢
⎥
⎢⎣0 1 0 0⎥⎦
1
2
3
1
⎡0 0 0 1 ⎤
⎢1 1 1 1⎥
⎢
⎥
⎣⎢0 1 0 0⎦⎥
⎡0 0 0 1 ⎤
⎢1 0 0 0 ⎥
⎢
⎥
⎢⎣0 1 0 0⎥⎦
2
⎡0 0 0 1 ⎤
⎢0 0 1 0⎥
⎢
⎥
⎢⎣0 1 0 0⎥⎦
1
1
1
3
1
3
3
2
P = M ' ( M ' A2 )T
⎡0 0 1 ⎤
compared with A1 = ⎢⎢0 0 1 ⎥⎥
⎢⎣1 1 0⎥⎦
⎡0 0 1 ⎤
= ⎢⎢0 0 1 ⎥⎥
⎣⎢1 1 0 ⎥⎦
187
1
3
2
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Tree Search based Algorithms
• Ullmann’s Algorithm
•A most widely used algorithm
• VF or VF2
•VF defines a heuristic based on the analysis of vertices
•VF2 reduces the memory requirement from O(n2) to O(n)
• Other methods: Nauty Algorithm
•Constructs the automorphism group of each of the input
graphs and derives a canonical labeling. The isomorphism
can be checked by verifying the equality of the adjacency
matrices
Exact Graph Matching
• Summary
•The matching problems are all NP-complete except for
graph isomorphism, which has not yet been shown in NP or
not.
•Exact isomorphism is very seldom used. Subgraph
isomorphism can be effectively used in many contexts.
•Exact graph matching has exponential time complexity in
the worst case.
•Ullmann’ algorithm, VF2 algorithm and Nauty algorithm are
mostly used algorithms. Most modified algorithms adopt
some conditions to prune the unfruitful partial matching.
188
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Error-Tolerant Graph Matching
• GED Computation
•Optimal algorithms
• Exact GED computation requires isomorphism testing
• Tree search based algorithms (A* based algorithms)
•Suboptimal algorithms
• Heuristic algorithms
• Formulated as a BLP problem
A* Algorithm
• A tree search based algorithm
•Similar to isomorphism testing
•Differently, the vertices of the source graph can potentially
be mapped to any node of the target graph
•Search tree is constructed dynamically
by edges to the currently vertex
•A heuristic function is usually used to
•determine the vertex for expansion
189
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Exact GED Computation
• Summary
•The complexity is exponential in the number of vertices of
the involved graphs.
•For graphs with unique vertex labels the complexity is linear.
•Exact graph edit distance is feasible for small graphs only.
•Several suboptimal methods have been proposed to speed up
the computation and make GED applicable to large graphs.
Bipartite Matching for GED
• A Heuristic Algorithm
•A new suboptimal procedure for the GED computation
based on Hungarian algorithm (i.e., Munkres’ Algorithm).
•Hungarian algorithm is used as a tree search heuristic.
•Much faster than the exact computation and the other
suboptimal methods
•Application for larger graphs
190
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Bipartite Matching for GED
• Assignment Problem
•Find an optimal assignment of n elements in a set S1 =
{u1, …, un} to n elements in a set S2 = {v1, …, vn}
•Let cij be the costs of the assignment (ui → vj)
•The optimal assignment is a permutation P = (p1, …, pn) of
the integers 1, …, n that minimizes
S1
c11
c12
c13
S2
Bipartite Matching for GED
• Assignment Problem
•Given the n &times; n matrix Mcij of the assignment costs
•This problem can be formulated as finding a set of n
independent elements of Mcij with minimum summation
S1
S2
1
5 4
5
7
6
58
8
•Hungarian algorithm finds the minimum cost assignment in
O(n3) time.
191
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Bipartite Matching for GED
• Main Idea
•Construct a vertex cost matrix Mcv and an edge cost matrix
Mce
•For each open vertex v in the search tree, run Hungarian
algorithm on Mcv and Mce
•The accumulated minimum cost of both assignments serves
as a lower bound for the future costs to reach a leaf node
•h(P) = Hungarian(Mcv) + Hungarian(Mce) is the tree search
hearistic
•Returns a suboptimal solution as an upper bound of GED
Suboptimal Algorithms
• Binary Linear Programming (BLP)
•Use the adjacency matrix representation to formulate a BLP
•Compute GED between G0 and G1
•Edit grid
192
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Binary Linear Programming
• Isomorphisms of G0 on the edit grid
• State vectors
Binary Linear Programming
• Definition:
• Objective Function:
193
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Binary Linear Programming
• Lower Bound: linear program (O(n7))
• Upper Bound: assignment problem (O(n3))
Summary
• The complexity of the exact GED computation is
exponential and unaccepted.
•
Suboptimal methods solve the graph matching problem
by fast returning the suboptimal solution and can be
applied to larger graphs.
• An important application of the graph matching problem
is searching a graph database.
194
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
Graph Search Problem
• Query a graph database
•Given a graph database D and a query graph Q, find all
graphs in D supporting the users’ requirements.
•
•
•
•
Full graph search (all match )
Subgraph search (partial match or containment search)
Similarity full graph search (based on GED)
Similarity subgraph search (based on GED)
195
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Scalability Issue
• On-line searching algorithm
100,000
100,000
checking
• A full sequential
Subgraphscan
isomorphism testing
•I/O costs
•Subgraph isomorphism testing (GED computation)
• An indexing mechanism is needed
Indexing Graphs
• Indexing is crucial
100,00
0
100,000
checking
filtering
Index
100,00
0
100 checking
answe
r
10
0
196
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Indexing Strategy
• Filter-and-refine framework based on features
Step 1. Index Construction
Enumerate smaller units (features)
in the database, build an index
between units and graphs
Step 2. Query Processing
Enumerate smaller units in the
query graph
Use the index to first filter out
non-candidates
checking
Indexing Strategy
• Feature-based Indexing methods
•Break the database graphs into smaller units like paths, trees,
and subgraphs, and use them as filtering features
•Build inverted index between the smaller units and the
database graphs
•Filter graphs based on the number of smaller units or their
locality information
197
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Featured-based Indexing Systems
Small units Smaller units
Query
GraphGrep
path
Contain (Containment search)
SING
path
Contain
gIndex
graph
Contain + Edge relaxation
FGIndex
graph
Contain
TREE∆
tree+graph
Contain
Treepi
tree
Contain
κ-AT
tree
Full similarity search
CTree
-
Contain + Edge relaxation
Path-based Algorithms
[http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
m]
198
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Path-based Algorithms
[http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
m]
Path-based Algorithms: problem
[http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
m]
199
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Feature-based Methods: limitation
• Problem:
•For similarity search, filtering is done by inferring the edit
distance bound through the smaller units that exactly match
the query structure
• A rough bound
• Not effective for large graphs (because features that may be
rare in small graphs are likely to be found in enormous graphs
just by chance)
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
200
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Graph Similarity Search Problem
• Definition
•Given a graph database D and a query structure Q, similarity search is
to find all the graphs in D that are similar to Q based on GED.
• Two challenges in the filter-and-refine framework:
•How to efficiently compute more effective edit distance
bounds between two graphs for filtering?
•How to reduce the number of pairwise graph dissimilarity
computations to speed up the graph search?
Our Solutions
• Work 1: Star decomposition
•Break each graph into a multiset of stars
•Propose new effective and efficient lower and upper GED
bounds through finding a mapping between the star sets of
two graphs using Hungarian algorithm
• Work 2: Sorted index for graph similarity search
•Propose a novel indexing and query processing framework
•Deploy a filtering strategy based on TA and CA methods to
reduce the number of pairwise graph dissimilarity
computations
201
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
Comparing Stars: On
Approximating Graph Edit
Distance
Zhiping Zeng
Anthony K.H. Tung
Jianyong Wang
Jianhua Feng
Lizhu Zhou
202
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Star Decomposition
• Star structure
•A star structure s is an attributed, single-level, rooted tree
which can be represented by a 3-tuple s=(r, L, l), where r is
the root vertex, L is the set of leaves and l is a labeling
function.
• Star representation for graph
•A graph can be broken into a multiset of star structures
c
a
b
c
d
G1
b
a
c
a
b
c
c
c
a
c
d
d
c
d
c
Star Decomposition
• Star edit distance
•Given two star structures s1 and s2,
•
λ(s1, s2) = T(r1, r2) + d(L1, L2)
•Where T(r1, r2) = 0 if l(r1) = l(r2); otherwise T(r1, r2) = 1
•
d(L1, L2) = ||L1| − |L2|| + M(L1, L2)
•
M(L1, L2) = max{| ΨL1|, | ΨL2|} − |ΨL1∩ΨL2|
Example: given s1 = abcc, and s2 =
dcc,
T(r1, r2) = 1, as l(a) ≠ l(d);
d(L1, L2) = |3-2| + 3 – 2 = 2;
λ(s1, s2) = 1 + 2 = 3.
203
a
b
c
s1
d
c
c
c
s2
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Star Decomposition
• Mapping distance
•Given two multisets of star structures S(G1) and S(G2) from
two graphs G1 and G2 with the same cardinality, and assume
P: S(G1) → S(G2) is a bijection. The mapping distance
between G1 and G2 is
•This problem can be formulated as the assignment problem.
Given a distance cost matrix between two star multisets, the
mapping distance can be computed using Hungarian
algorithm.
A Simple Example
204
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Bounds of GED
• Lower Bound
•Let G1 and G2 be two graphs, then the mapping distance
μ(G1, G2) between them satisfies
μ(G1, G2) ≤ max{4, [min{δ(G1), δ(G2)]} + 1]} &middot; λ(G1,
G2)
• Based on the above Lemma, μ provides a lower bound
Lm of λ, i.e.,
Constructing the cost matrix takes Θ(n3), and running
the Hungarian algorithm takes O(n3).
Bounds of GED
• Upper bound
•The first upper bound τ comes naturally during the
computation of μ
•The output from the computation of using Hungarian
algorithm leads to a mapping P’ from V(G1) to V(G2)
•Recall the BLP method, exact GED is computed as
•Therefore,
is a naturally upper bound
The mapping P’ might not be optimal, so τ (G1, G2)≥λ(G1, G2).
C(G1, G2, P’) is solved in Θ(n2) time, therefore, τ can be computed in
Θ(n3) time.
205
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Bounds of GED
• Refined upper bound ρ: main idea
•Given any two vertices v1 and v2 in G1 and their
corresponding mapping f(v1) and f(v2) in G2 (assuming f is
the mapping function corresponding to P’), we swap f (v1)
and f (v2) if this reduce the edit distance.
c
a
b
c
G1
ε
d
a
c
c
a
G2
b
c
G1
ε
d
a
c
G2
new mappings obtained might lead to better or worse
bounds. Refining to get a better takes O(n6).
Filtering Strategy
• Integrating all the GED bounds into a filter-and-refine
framework
•Filtering features: Lm ≤ λ ≤ ρ ≤ τ.
•Filtering orders: bounds with lower computation complexity
are deployed first.
206
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Full Graph Similarity Search
• Problem
•Given a graph database D and a query structure Q, find all
the graphs Gi in D with λ(Q, Gi) ≤ d (d is a threshold).
• AppFULL algorithm:
•if Lm(Q, Gi) &gt; d, Gi can be safely filtered;
•if τ(Q, Gi) ≤ d, Gi can be reported as a result directly;
•if ρ(Q, Gi) ≤ d, Gi can be reported as a result directly;
•otherwise, λ(Q, Gi) must be computed.
Subgraph exact Search
• Lemma
•Given two graphs G1 and G2 , if no vertex relabelling is
allowed in the edit operations, μ’(G1, G2) ≤ 4 &middot; λ’(G1, G2),
where μ’ and λ’ are computed without vertex relabelling.
•(This Lemma can be used in subgraph search, because if a
graph is subisomorphism to another graph, no vertex
relabelling happens.)
• AppSUB algorithm:
•Filtering based on the lower bound
.
207
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Experimental Results
• Compare with the exact algorithm
1,000 graphs were generated, D = 1k,T = 10,V = 4.
Randomly select 10 seed graphs to form D; a seed has 10 vertices.
6 query groups. Each group has 10 graphs. Graphs in the same
group have the same number of vertices.
Experimental Results
• Compare with the BLP method
208
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Experimental Results
• Scalability over real datasets
Experimental Results
• Scalability over synthetic datasets
209
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Experimental Results
• Performance of AppFULL
Experimental Results
• Performance of AppSUB
210
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
SEGOS: SEarch similar
Graphs On Star index
Xiaoli Wang
Xiaofeng Ding
Anthony K.H. Tung
Shanshan Ying
Hai Jin
211
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Our Solutions
• Work 1: Scalability issue
•A full database scan
•A index mechanism is needed
• Existing indexing methods: Filtering power
•Rough bounds with poor filtering power
• Work 2: Sorted index for graph similarity search
•Propose a novel indexing and query processing framework
•Deploy a filtering strategy based on TA and CA methods
•All exiting lower and upper GED bounds can be directly
integrated into our filtering framework
TA Method on the Top-k Query
• The database model used in TA
M
N
Sorted L1
Sorted L2
0.85
(a, 0.9)
(d, 0.9)
0.8
0.7
(b, 0.8)
(a, 0.85)
c
0.72
0.2
(c, 0.72)
(b, 0.7)
d
0.6
0.9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
Object
ID
A1
A2
a
0.9
b
212
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
TA method on the top-k query
• A simple query
•Find the top-2 objects on the ‘query’ of ‘A1&amp;A2 ’
•This query results in the TA method combing the scores of
A1 and A2 by an aggregation function like
sum(A1,A2)
Aggregation function:
function that gives objects an overall score based on attribute
scores
examples: sum, min functions
Monotonicity!
Monotony on TA (Halting Condition)
• Main idea
•How do we know that scores of seen objects are higher than
•Predict maximum possible score unseen objects:
L2
L1
Seen
Possibly unseen
a: 0.9
d: 0.9
b: 0.8
a: 0.85
c: 0.72
.
.
.
f: 0.65
.
d: 0.6
b: 0.7
.
f: 0.6
.
.
.
c: 0.2
213
ω = sum(0.72, 0.7) =
1.42
Threshold value
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Given 2 sorted lists for attributes A1 and A2,
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
A Top-2 Query Example
• Step 1
•Parallel sorted access attributes from every sorted list
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
a
0.9
d
214
A2
sum(A1,A2)
0.9
1.5
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
a
0.9
d
A2
sum(A1,A2)
0.9
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
a
0.9
0.85
d
215
0.9
sum(A1,A2)
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.9
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
216
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
ω = sum(0.9, 0.9) = 1.8
217
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
ω = sum(0.9, 0.9) = 1.8
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
218
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
219
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
b
0.8
0.7
1.5
A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
220
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2 (Again)
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
ω = sum(0.8, 0.85) = 1.65
A Top-2 Query Example
• Step 2 (Again)
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
221
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
A Top-2 Query Example
Situation at stopping:
ω = sum(0.72, 0.7) = 1.42 &lt; 1.5
L1
L2
(a, 0.9)
(d, 0.9)
(b, 0.8)
(a, 0.85)
(c, 0.72)
(b, 0.7)
.
.
.
.
.
.
.
.
(d, 0.6)
(c, 0.2)
ID
A1
A2
sum(A1,A2)
a
0.9
0.85
1.75
d
0.6
0.9
1.5
TA-based Filtering Strategy for Graph
Search Problem
• Main idea
•Each graph is broken into a multiset of stars
•Each distinct star generated from the database graphs can be
seen as an index attribute in the TA database model
•Each entry in the sorted lists contains the graph identity
(denoted by gi) and its score (denoted by λ) in that star
attribute, the score is defined as the star edit distance between
a star of gi and the index star
•Halting condition: given m sorted lists, if the aggregation
function of ω = sum(λ1,…, λm)≥d (d is the edit distance
threshold bound for graph mapping distance), TA stops.
222
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
TA-based Filtering Strategy for Graph
Search Problem
• Challenges:
•How do we know that the distance threshold is larger than
those of unseen graphs (these graphs can be safely filtered
out)? Predict minimum possible mapping distance for unseen
graphs:
Seen
Possibly unseen
L1
L2
g1: 0
g4: 0
g2: 1
g1: 1
g3: 2
.
.
g6.: 5
.
g4: 6
g2: 3
.
g6.: 5
.
.
g3: 9
ω = sum(2, 3) = 5 &gt; d (= 4)
Threshold value
TA-based Filtering Strategy for Graph
Search Problem
• A graph database with a query example
Sorted list L1
Sorted list L2
Sorted list L3
223
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Requirement
• An index structure
•Convenient for score-sorted lists construction
• Efficient star search algorithm
•Quickly return similar stars to a query star
• Sorted properties for the halting condition of TA
•The mapping distance of any unseen graph gi satisfies
λ(q, gi) ≥ ω = sum(λ1,…, λm) &gt; d (=τ*δ’)
q is the query graph, τ is the distance threshold, and
where D’ is the set of all unseen graphs.
•Requirement
Recall the mapping distance in our previous work
satisfy:
•μAn
(q, index
gi) ≤ structure
max{4, [min{δ (q), δ(gi)]} + 1]} &middot; λ(q, gi)
•Convenient for score-sorted lists construction
δ(q,
gi) =algorithm
max{4, [min{δ (q), δ(gi)]} + 1]},
denotestar
•We
Efficient
search
then
δ
(q,
g
)
≤
δ
’.
i
•Quickly return
similar stars to a query star
If μ(q, gi) &gt; τ*δ’, then λ(q, gi) &gt; τ*δ’/δ &gt; τ,
• Sorted properties for the halting condition of TA
and this graph can be safely filtered out.
•The mapping distance of any unseen graph gi satisfies
•
μ(q, gi) ≥ ω = sum(λ1,…, λm) &gt; d (=τ*δ’)
•q is the query graph, τ is the distance threshold, and
•δ’ = max{4, [min{δ(q), δ(D’)]} + 1]}
•where D’ is the set of all unseen graphs.
224
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Requirement
• An index structure
•Convenient for score-sorted lists construction
• Efficient star search algorithm
•Quickly return similar stars to a query star
• Sorted properties for the halting condition of TA
•The mapping distance of any unseen graph gi satisfies
•
λ(q, gi) ≥ ω = sum(λ1,…, λm) &gt; d (=τ*δ’)
•q is the query graph, τ is the distance threshold, and
•
•where D’ is the set of all unseen graphs.
Build Inverted Index Structures based
on the Star Decomposition
• The upper-level index
•Build an inverted index between stars and graphs
•Used to quickly returned graph lists
• The lower-level index
•Build an inverted index between labels and stars
•Used to construct the sorted lists
•for top-k star search based on TA
•filtering strategy
225
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Build Inverted Index Structures based
on the Star Decomposition
Top-k Star Search Algorithm
• Construct sorted lists
226
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Graph Score-sorted Lists
• Construct lists based on the top-k results
TA-based Graph Range Query
• Definition
•Given a graph database D and a query q, find all gi ∈ D that
are similar to q with λ(q, gi) ≤ τ. τ is the distance
threshold.
• Steps: given m sorted lists for a query graph q
•Perform sorted retrieval in a round-robin schedule to each
sorted list. For a retrieved graph gi, if Lm(q, gi) &gt; τ, filter out
the graph; if Um(q, gi) ≤ τ, report the graph to the answer
set.
•For each sorted list SLj, let χj be the corresponding distance
last seen under sorted access. If ω = sum(χ1,…, χm) &gt;
τ∗δ’, then halt. Otherwise, go to step 1.
227
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
CA-based Filtering Strategy
• The difference between TA and CA
•TA computes the mapping distance between two graphs
when retrieving a new graph through sorted accesses
•Only in each h depth of the sorted scan, for seen and
unprocessed graphs, CA uses estimated mapping distance
bounds to first filter graphs; Then, it uses Incremental
Hungarian algorithm to compute the partial mapping
distances for filtering
CA-based Filtering Strategy
• Suppose l(g) = {l1,…,ly} ⊆ {1,2,…,m} is a set of known
lists of g seen below q. Let χ(g) be the multiset of
distances of the distinct stars of g last seen in known lists.
•Lower bound denoted by Lμ(q, g) is obtained by substituting
the missing lists j ∈ {1,2,…,m}\l(g) with χj (the distance
last seen under the jth list) in ζ(q, g)
•Upper bound denoted by Uμ(q, g) is computed as
Uμ(q, g) = t′(χ(g)) + χ ∗ (|g| − |χ(g)|)
• Theorem: Let g1 and g2 be two graphs, the bounds
obtained as above satisfies
ζ(g1, g2) ≤ Lμ(g1, g2) ≤ μ(g1, g2) ≤ Uμ(g1, g2)
228
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
CA-based Filtering Strategy
• Dynamic hungarian for partial mapping distance
•Given m sorted lists for q, suppose S′(g) ⊆ S(g) is a
multiset of stars in g seen below lists. Then we have μ(S(q),
S′(g)) ≤ μ(q, g)
CA-based Graph Range Query
• Steps: given m sorted lists for a query graph q
•Perform sorted retrieval in a round-robin schedule to each
sorted list. At each depth h of lists:
• Maintain the lowest values χ1, . . . , χm encountered in the
lists. Maintain a distance accumulator ζ(q, gi) and a multiset
of retrieved stars S′(gi) ⊆ S(gi) for each gi seen under lists.
• For each gi that is retrieved but unprocessed, if ζ(q, gi) &gt; τ∗δgi,
filter out it; if Lμ(q, gi) &gt; τ∗δgi, filter out it; if Uμ(q, gi) ≤ τ∗δgi ,
add the graph to the candidate set. Otherwise, if μ(S(q), S′(gi )
&gt; τ∗δgi, filter out the graph. Finally, run the Dynamic
Hungarian to obtain Lm(q, gi) and Um(q, gi) for filtering.
•When a new distance is updated, compute a new ω. If ω =
t′(χ) &gt; τ∗δ′, then halt. Otherwise, go to step 1.
229
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Experimental Results: Sensitivity test
Experimental Results: Index construction
230
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
Experimental Results: compare with other
works varying distance thresholds
Experimental Results: compare with other
works varying dataset sizes
231
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
References
• D. Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento.
Thirty Years of Graph Matching in Pattern Recognition.
• P. Foggia, C. Sansone and M. Vento. A performance
comparison of five algorithms for graph isomorphism. In 3rd
IAPR-TC15 workshop on graph-based representations in
pattern recognition, 2001.
• K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph
matching for computing the edit distance of graphs. In GBRPR,
2007.
• P. Hart, N. Nilsson, and B. Raphael. A formal basis for the
heuristic determination of minimum cost paths. IEEE Trans.
SSC, 1966.
References
• D. Justice. A binary linear programming formulation of the
graph edit distance. IEEE TPAMI, 2006.
• R. Giugno and D. Shasha. Graphgrep: A fast and universal
method for querying graphs. In ICPR, 2002.
• R. D. Natale, A. Ferro, R. Giugno, M. Mongiov&igrave;, A. Pulvirenti,
and D. Shasha. SING: subgraph search in non-homogeneous
graphs. BMC Bioinformatics, 2010.
• X. Yan, P.S. Yu, and J. Han. Graph indexing: a frequent
structure-based approach. In SIGMOD, 2005.
• J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: towards
verification-free query processing on graph databases. In
SIGMOD, 2007.
232
Mining and Searching Complex Structures
Chapter 5 Graph Similarity Search
References
• D.W. Williams, J. Huan, and W. Wang. Graph database
indexing using structured graph decomposition. In ICDE, 2007.
• S. Zhang, M. Hu, and J. Yang. Treepi: a novel graph indexing
method. In ICDE, 2007.
• P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree + delta
&gt;= graph. In VLDB, 2007.
• G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing
large sparse graphs for similarity search. IEEE TKDE, 2010.
233
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Searching and Mining Complex
Structures
Massive Graph Mining
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Graph applications: everywhere
And often, they are huge and messy.
social network
Bio Pathway
Co-authorship
network
234
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Knowledge: NOWHERE
Unless we manage to find where they hide.
Too many clues is like no clue.
Part I (1.5 hrs)
•Graph Mining Primer
•Recent advances in Massive Graph Mining
Part 2(1.5 hrs)
•CSV: cohesive subgraph Mining
•Dngraph mining: a triangle based approach
235
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
• Graph Mining Primer
•
•
•
•
Data mining vs. Graph mining
Massive graph mining domain
Types of graph patterns
Properties of large graph structure
• Recent advances in Massive Graph Mining
• CSV: cohesive sub graph Mining
• DNgraph mining: a triangle based approach
From Data Mining to Graph Mining
Data Mining
• Classification
• Clustering
• Association rule
learning
•
raph Mining
• Captures more complicated
entity relationships.
• Output: patterns, which are
smaller subgraphs with
interpretable meanings.
236
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Massive graph mining domains
•
•
•
•
Financial data analyzing
Bioinformatics network
User profiling for customized search
Identify financial crime
Financial data analysis
In stock market,
correlations among
stocks helps in profit
making.
Mining stock
correlation graphs
predicting stocks'
price change for
estimating future
return, allocating
portfolio and
controlling risks etc.
Stocks Correlation Tabular Form
Stocks Correlation Patterns
237
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Financial data analysis
In stock market,
correlations among
stocks helps in profit
making.
Mining stock
correlation graphs
predicting stocks'
price change for
estimating future Highly
return, allocating correlated
stock sets
portfolio and
controlling risks etc.
Stocks Correlation Tabular Form
Stocks Correlation Patterns
Bioinformatics network
•Protein-protein interaction
• The fundamental
activities for very
numerous living cells.
• A dense graph pattern
indicates these proteins
have similar functionalities.
one representation of an assembled
NEDD9 network
238
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
User profiling for customized search
The Internet Movie Database (IMDB)
Registered users can comment on movies of their interest.
Mining on comments sharing network provides insight of
user’s interest thus further facilitate customized search.
Movie centric
view of IMDB
review network
Identify financial crime
Large classes of financial crimes such as money laundering,
A money laundering pattern
Geospatial information of suspects
239
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Dense Graph Patterns
Clique/Quasi-Clique
A clique represents the highest level of internal interactions.
Quasi-clique is an ``almost'' clique with few missing edges.
High Degree Patterns
Concern the average vertex degree, which is the number of
edges intercepting the vertex.
Dense graph patterns (cont.)
Dense Bipartite Patterns
Heavy Patterns
Weighted, directed graph of
online citation network, by
Rosvall &amp; Bergstrom
Bipartite graph of pathways and
genes for the AML/ALL dataset.
240
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Properties of large graph structure
Static
•Power law degree distributions.
•Small world phenomenon.
•Communities and clusters.
Dynamic
•Shrinking diameters of enlarging graphs
•Densification along time
Power law
241
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Large graph: properties and laws (cont.)
Dynamic
•Shrinking diameters of enlarging graphs.
•Densification along time
• Graph Mining Primer
• Large graph: properties and laws
• Approaches in Graph mining
• Pattern based Mining algorithms
• Practical techniques in Massive Graph Mining
• Graph summarization with randomized sampling
•Connectivity based traversal
•MapReduce based
• CSV: cohesive subgraph Mining
• Dngraph mining: a triangle based approach
242
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Pattern based Mining algorithms
Greedy methods
SUBDUE (PWKDD04), GBI(JAI94)
Apriori-based approaches (detail in next few slides)
AGM , FSG, gSpan
Inductive logic programming (ILP) oriented solutions
WARMR, FARMAR
Kernel based solutions
Kernels for graph classification
manner
Use a Lattices structure
to count candidate
subgraph sets
efficiently.
A search lattice for item set mining
243
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Apriori-based Graph Mining
Performance bottleneck: candidate subgraph generation.
Solution:
1. Build a lexicographic order among graphs.
2. Search using depth-first strategy.
Very effective in mining large collections of small to medium
size graphs.
Graph summarization with randomized
sampling
• Efficient Aggregation for Graph Summarization –
SIGMOD 2008
• Graph Summarization with Bounded Error-SIGMOD
2008
• Mining graph patterns efficiently via randomized
summaries - VLDB 2009
244
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Efficient Aggregation for Graph
Summarization
As graph size increases, graphs summarization becomes
crucial when visualize the whole graph.
Criteria for an efficient summarization solution
Able to produce meaningful summarization for real
application.
Scalable to large graphs.
The choice: graph aggregation
Graph Aggregation
1. Summarization based on user-selected node attributes and
relationships.
2. Produce summaries with controllable resolutions.
“drill-down” and “roll-up” abilities to navigate
Propose two aggregation operations
k-SNAP
245
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Operation SNAP
Group nodes by user-selected node attributes &amp; relationships
Nodes in each group are homogenous (in terms of attributes
and relationships).
Goal: minimum # of groups
How does SNAP work?
Top down approach
Initial Step: Use user selected attributes to group nodes.
Iterative Step:
If a group are not homogeneous w.r.t. relationships, split the
group based on its relationships with other groups.
246
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
SNAP limitation
Homogeneity requirement for relationships
Noise and uncertainty
Users have no control over the resolutions of summaries
SNAP operation can result in a large number of small groups
Operation k-SNAP
The entities inside a group are not necessarily
homogenous in terms of relationships with other
groups.
Users can control resolution by specifying k (#
groups).
Varying of k provides “drill-down” and “roll-up”
abilities.
247
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Access quality of summarization
Determined by sum of noisy relations.
When the relationship between two relationships are strong
(&gt;50%), count missing participants.
When the relationship between two relationships are weak
(&lt;=50%), count extra participants.
K-SNAP goal
Find the summary of size k with best quality.
I.e. minimal Δ.
The exact solution to minimize Δ is NP-Complete.
Heuristics
Top down: split a group into 2 at each iteration.
Choose the group with worst quality and split.
Bottom up: merge 2 groups into 1
Choose same attribute values, similar neighbors, or similar
participation ratio.
248
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Major results
Double-blind
review’s
effect on LP
authors.
k-SNAP: Top down vs. Bottom up
249
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Graph Summarization with Bounded
Error
Large graph data needs compression
Compression can reduce size to 1/10 (web graph)
Graph compression vs. Clustering
Compression
Clustering
use urls, node labels works for generic
for compression
graphs
Result lacks meaning No compression for
space saving
Solution: MDL Based
Compression for Graphs
Intuition
d
Many nodes with similar
neighborhoods
• Communities in social networks;
Collapse such nodes into
supernodes (clusters)
and the edges into
superedges
e
a
f
b
g
c
X = {d,e,f,g}
• Bipartite subgraph to two
supernodes and a superedge
• Clique to supernode with a
“self-edge”
250
Y = {a,b,c}
Summary
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
How to choose vertex sets to compress
Cost = 14 edges
d
e
f
g
i
h
MDL based compression
S is a high-level summary graph:
C is a set of edge corrections:
minimize cost of S+C
Novel Approximate Representation:
reconstructs graph with bounded error
(є); results in better compression
a
b
j
c
Summary
X = {d,e,f,g}
i
h
i
Y = {a,b,c}
Corrections
Cost = 5
+(a,h)
+(c,i)
+(c,j)
‐(a,d)
(1 superedge +
4 corrections)
Compress (cont.)
Summary S(VS, ES)
X = {d,e,f,g}
Each supernode v represents a set of nodes Av
Each superedge (u,v) represents
all pair of edges πuv = Au x Av
h
Corrections C: {(a,b); a and b are nodes
of G}
Supernodes are key,
superedges/corrections easy
Auv actual edges of G between Au and Av
Cost with (u,v) = 1 + |πuv – Euv|
Cost without (u,v) = |Euv|
Choose the minimum, decides whether edge
(u,v) is in S
i
Y = {a,b,c}
C = {+(a,h), +(c,i), +(c,j), -(a,d)}
d
e
f
g
h
i
a
251
j
b
c
j
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Reconstruct
Reconstructing the graph from R
For all superedges (u,v) in S, insert all pair of edges πuv
For all +ve corrections +(a,b), insert edge (a,b)
For all -ve corrections -(a,b), delete edge (a,b)
Approximate Representation Rє
X = {d,e,f,g}
Approximate representation
Recreating the input graph exactly is not always necessary
Reasonable approximation enough: to compute communities,
anomalous traffic patterns, etc.
Use approximation leeway to get further cost reduction
Y = {a,b}
C = {-(a,d), -(a,f)}
Generic Neighbor Query
Given node v, find its neighbors Nv in G
Apx-nbr set N’v estimates Nv with є-accuracy
Bounded error: error(v) = |N’v - Nv| + |Nv - N’v| &lt; є |Nv|
Number of neighbors added or deleted is at most є-fraction of
the true neighbors
d
g
b
For є=.5, we can remove
one correction of a
d
e
a
252
f
a
Intuition for computing Rє
If correction (a,d) is deleted, it adds error for both a and d
From exact representation R for G, remove (maximum)
corrections s.t. є-error guarantees still hold
e
f
g
b
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Main Results: cost reduction
Reduces the cost down to 40%
Cost of GREEDY 20%
lower than RANDOMIZED
RANDOMIZED is 60%
faster than GREEDY
Comparison with other schemes
Techniques give much
better compression
253
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Approximate-Representation
Cost reduces linearly
as є is increased;
With є=.1, 10% cost
reduction over R
Mining graph patterns efficiently via
randomized summaries
Motivation
In a graph with large number of identical labeled vertices,
graph isomorphism becomes a bottleneck.
How to avoid enumerate identical patterns?
3 (triangular) &times; 4 (square) = 12 (total)
254
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Solution framework
Summarization-&gt;Mining-&gt;Verification
Raw DB
Summarized DB
Raw DB
Reduce false positive
• Technique 1: merge vertices that are far away from each
other.
•The length of the shortest path
•The probability of random walk
• Technique 2: merge vertices whose neighborhood overlap.
•Cosine, Chi^2, Lift, Coherence
• Technique 3: Go back to raw database to do verification
It is guaranteed that there is no false positives.
Summarization may cause false positive
a
a
b
b
a
b
b
255
False Embeddings
&Icirc; False Positives
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Summarization: Reduce false negative
a
b
a
Miss Embeddings &Icirc; False Negatives
a
c
b
c
Technique 1: For raw database with frequency threshold min_sup,
we adopt a lower frequency threshold pseudo min_sup for
summarized database.
Technique 2: Iterate the mining steps for T times and combine the
results generated in each time.
It is NOT guaranteed that there is no false positives, but the
possibility is bounded by
Connectivity based traversal
CSV: Cohesive Subgraph Mining –SIGMOD 2008
(Discussed in detail in Part II)
Progressive Clustering of Networks using
Structure-Connected Order of Traversal –ICDE 2010
256
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Progressive clustering of networks using structureconnected order of traversal
SCAN Algorithm
•Similar to DBSCAN: connectivity-based
•Average O(n) time
•Uses structural similarity measure, minimum cluster size mu, and
minimum similarity epsilon
•Finds outliers and hubs
Problems
•No automated way to find good epsilon
•Must rerun algorithm for each possible epsilon
•Epsilon is global threshold
• No hierarchical clusters
• No variation in cluster subtlety
Solution
• Structure-Connected Order of Traversal (SCOT)
•Contains all possible epsilon-clusterings
• Efficient method to find global epsilon
• New Contiguous Subinterval Heap structure
(ContigHeap)
• New Progressive Mean Heap Clustering (ProClust)
•Epsilon-free
•Hierarchical
• Refinement by Gap Constraint (GapMerge)
257
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Original Network:
SCOT plot:
Optimal Global Epsilon
SCAN paper only contains supervised
sampling method.
Sample points, find k-NN similarities, sort,
plot, find knee visually
O(nd log n) time
Our solution:
Knee hypothesis implies approx concave
plot
Optimal epsilon minimizes obtuse angle
between segments
Modified histogram and binary search: O(n)
time
258
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
ContigHeap
BuildContigHeap produces heap containing
all contiguous subintervals from SCOT
output in O(n) time, and integrates with
SCOT
Example:
GapMerge: Gap Constraint Refinement
Merges chained clusters, heap branches with single children
Does not merge across pruned heap nodes (local maxima boundary)
Gap constraint prevents clusters whose left or right boundaries differ by more
than mu from being merged
Such clusters are not redundant relative to the minimum interesting cluster size
Steps
1.Identify chains that meet gap constraint
2.When a node has more than one child or violates gap constraint, begin new chain.
3.Within each chain, calculate significance of each cluster in both up and down
directions
4.Begin with most redundant node, merge nodes in direction of least significance
5.After each merge, recalculate significances
6.Continue until chain contains one node, or no merging possible under gap constraint.
259
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
MapReduce based approach
PEGASUS: A Peta-Scale Graph Mining System –ICDM
2009
Pregel: a system for large-scale graph processing SIGMOD
2010
PEGASUS: A Peta-Scale
Graph Mining System
Dealing with real graph such as Yahoo! Web graph up to 6.7
billion edges.
A Hadoop based graph mining package.
Target at primitive matrix operations such as matrix
multiplication (GMI-v).
260
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Motivation
Many Graph mining tasks require matrix
multiplication
PageRank,
Random Walk with Restart(RWR),
Diameter estimation, and
Connected components …
MapReduce provides a simplified programming
concept for large data processing
Details of the data distribution, replication, load balancing are
taken care of.
Provides a similar programming structure. i.e. functional
programming
GIM-V: Generalized Iterative MatrixVector multiplication
Intuition: Matrix Multiplication
M &times; v = v'
combine2
n
v i' = ∑ j =1 m
combineall
Assign
Operator&times; G are matrix multiplication expressed by above 3
steps
&times; G is iteratively carried out until converge.
261
i, j
vj
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
&times; G and SQL
The matrix multiplication operation can be expressed by
an SQL query.
&times;G
If view graphs are two table:
edge table E(sid, did, val) and
a vector table V(id, val)
becomes
&times;G
SELECT E.sid, combineAllE.sid(combine2(E.val, V.val))
FROM E, V
WHERE E.did = V.id
GROUP BY E.sid
Generalize &times; G
Vary definition of three steps to generalize &times; G
PageRank
matrix
p = (cE T + (1 - c)U)p
All element = 1/n
Damping factor = 0.85
262
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Generalize &times; G
Vary definition of three steps to generalize
PageRank
&times;G
p = (cE T + (1 - c)U)p
combine2 = c &times; mi, jvj
combineAll =
1- c
n
+ ∑ j=1 xj
n
Generalize (cont.)
By altering three functions, GIM-V adapts to
• Random Walk with Restart
• Diameter Estimation
• Connected Components
263
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
GIM-V: How to
Stage 1
Combine2
V: Key = id, v: vval, E: Key = idsrc
State 2
Combineall &amp; assign
Bottleneck: shuffling and disc I/O
GIM_V Block Multiplication (BL)
Save on sorting
Data compressing
Clustered Edge
264
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Clustered edge:
GIM-V DI Dialogonal Block Iteration
Intuition
Increase multiplication
inside an iteration to
reduce # of iterations.
How
Reach local convergence
within a block first before
iterate
Compare GIM-V BL and DI
265
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Main Results
Scalability
GIM-V BL DI is ~5 times faster than GIM-V Base
Main Results (cont.)
Distribution of connected
components are stable after a
‘gelling’ point in 2003.
266
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Main Results (cont.)
Pregel: A System for Large-Scale
Graph Processing
A scalable and fault-tolerant platform with an API that is
sufficiently flexible to express arbitrary graph algorithms.
Model of computing:
Vertex centric, synchronized iterative model
267
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Graph Algorithms Implementation in
Pregel
Graph data are in respect machines, pass messages only, NO
graph state passing.
Pregel C++ API
• Compute() - executed at each active vertex in every
superstep.
•Query information about the current vertex and its edges.
•Send messages to other vertices.
•Inspect or modify the value associated with its vertex/outedges.
•state updates are visible immediately. no data races on
concurrent value access from diefferent vertices
• Limiting the graph state managed by the framework to
single value per vertex or edge simplifies the main
computation cycle, graph distribution, and failure
recovery.
268
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Pregel C++ API (cont.)
• Message Passing
•No guaranteed order, but it will be delivered and no
duplication.
• Combiners
•Combine several messages to reduce overhead
• Aggregators
•Mechanism for global communication, monitoring, and data.
•A number of predefined aggregators, such as min, max, or
sum operations
• Topology mutation
•Change graph toplogy, resolve conflicts when individual
vertices sent conflict messages.
Pregel C++ API (cont.)
• Input and output
269
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Pregel implementation
• Design for Google cluster architecture
•Each consists of thousands of commercial PCs
• Persistent data
•Stored in files on distributed file systems such as GFS or
BigTable
• Temporary data
•Stored as buffered message on local disk.
• Divide graph vertices into partitions and assign to
different machines
•controllable by users, default method: hash
• In absence of fault:
•One master, many other workers on a cluster of machines.
• master assign load jobs, i/o and instruct on super steps
• Fault tolerent:
•Use checkpoint: master ping workers
•Confined recovery (undergoing): master log outgoing
message
270
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Graph Application
PageRank
Shortest Path
Bipartite Matching
Semi Cluster
Pregel: Main Result
271
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Reference (partial)
Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations by J.
Leskovec, J. Kleinberg, C. Faloutsos. (KDD), 2005.
Substructure Discovery in the SUBDUE System. L. B. Holder, D. J. Cook and S. Djoko. In
(PWKDD), 1994.
Efficient Aggregation for Graph Summarization – Yuanyuan Tian, Richard A. Hankins, Jignesh M.
Patel SIGMOD 2008
Graph Summarization with Bounded Error-Saket Navlakha, Rajeev Rastogi, Nisheeth Shrivastava
SIGMOD 2008
Mining graph patterns efficiently via randomized summaries Chen Chen, Cindy X. Lin, Matt
Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han - VLDB 2009
Progressive Clustering of Networks using
Structure-Connected Order of Traversal Dustin Bortner, Jiawei Han –ICDE 2010
PEGASUS: A Peta-Scale Graph Mining System U. Kang, Charalampos E. Tsourakakis,
ChristosFaloutsos, ICDM
Graph based induction as a unified learning framework, K. Yoshida, H. Motoda, and N. Indurkhya.
Applied Intelligence volume 4, 1994.
Complete mining of frequent patterns from graphs: Mining graph data. Akihiro,W. Takashi, and
M. Hiroshi. Mach. Learn., 50(3):321–354, 2003.
Reference (cont.)
Frequent subgraph discovery, K. Michihiro and G. Karypis. In ICDM, pages 313–320, 2001.
gSpan: Graph-based substructure pattern mining, X. Yan and J. Han. ICDM 2002.
WARMR Discovery of frequent datalog patterns. L. Dehaspe and H. Toivonen. Data Mining and
Knowledge Discovery, 3(7-36), 1999.
FARMAR Fast association rules for multiple relations. S. Nijssen and J. Kok. Data Mining and
Knowledge Discovery, 3(7–36), 1999.
272
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Part I (1.5 hrs)
Graph Mining Primer
Recent advances in Massive Graph Mining
Part 2(1.5 hrs)
CSV: cohesive subgraph Mining
Dngraph mining: a triangle based approach
CSV
1. Cohesive sub-graph mining, with visualization
2. Existing approaches
3. CSV provides effective visual solution
– Algorithm principle
– Connectivity Estimation
4. Experimental Study
273
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Existing solutions
1. Current state-of-the-art to abstract information from huge
graphs.
information Yes,
structure
No.
1. Graph partition algorithms.
Spectral clustering[Ng01]: high computational cost
METIS[Karypis96]: favors balanced pattern
2. Graph Pattern Mining algorithms
CODENSE[Hu05], CLAN[Zeng06]: exponentially running time
2. Graph Layout Tools:
Osprey [Breitkreutz03] Visant [Mellor04]: Do not have mining
capability
information No,
We want structured information
structure Yes.
CSV: General Approach
• Separate vertices in the graph into VISITED, UNVISITED
• Start: Pick a vertex and add into VISITED
• Repeat until UNVISITED=empty
–Among all vertices that are in UNVISITED, pick one vertex V
most highly connected to VISITED
–Plot V’s connectivity
But how do we measure connectivity?
274
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Connectivity measurement
Connectivity measurement is closely related to clique (fully connected subgraph) size.
The connectivity between two vertices in
a graph (ηmax) is defined to be the
biggest clique in the graph such that
both are members of the clique
The “connectivity” of a vertex
(ζmax) is similarly defined
as the biggest clique it
can participate.
b
b
a
c
e
c
a
d
e
ηmax(a, d) = 0
ηmax(a, c) = 4
d
ζmax(a) = 5
CSV: Step by Step
heap
From Graph to Plot
A
E
B
connectivity
D
C
4
3
H
F
I
G
J
2
1
B
A
vertices
unvisited
neighbors
visiting
visited
Start from A, explore A’s neighbor B.
Calculate ζmax (A)=2 and output it
275
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
CSV algorithm on a synthetic graph
heap
From graph to plot
A
E
B
connectivity
D
C
H
F
I
G
J
4
3
C
2
F
H
B
1
unvisited
AB
neighbors
vertices
visiting
Mark A visited, from B, explore B’s
immediate neighbors CFH.
visited
Calculate ηmax (AB)=2 and output it
CSV algorithm on a synthetic graph
heap
From graph to plot
A
E
B
connectivit
y
D
C
F
3
H
F
4
I
G
J
H
C
2
G
F
1
H
D
A BC
vertices
unvisited
neighbors
visiting
visited
Mark B visited, choose the closely
connected C as next visiting vertex. From
C, explore C’s immediate neighbors DFGH,
update ηmax when necessary.
Calculate ηmax (BC)=4 and output it
276
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
CSV algorithm on a synthetic graph
From graph to plot
A
E
B
Cohesive subgraph
connectivity
D
C
4
3
H
F
I
G
2
1
J
ABCH FGDE I J
unvisited
vertices
neighbors
visiting
visited
Visit every vertex accordingly to produce a
plot.
Peaks represent cohesive sub-graphs.
Important Theorem
277
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Connectivity computation is a hard
problem
However, if graphs are very huge and massive, exact computation of
connectivity is prohibitive.
Direct computation
is costly
Connectivity computation is
prohibitive
•Exact algorithm relays on
clique detection (NP-hard).
•Even approximation is hard.
•Solution Part 1: Spatial
Mapping
A
I
H
F
E
D
C
B
G
J
•Pick k pivots
•Map graph into kdimensional space based on
their shortest distance to the
pivots
•A clique will map into the
same grid.
P1
3
I
A
2
E
B
1
C
GJ
D
H
0
278
F
1
2
I
3
P0
A
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Connectivity computation
•Solution Part 2: Approximate
Upper Bound for ζmax(v) and
ηmax(v, v’)
•Each vertex in a clique of size k
must have
•degree=k-1
•k-1 neighbors with degree k-1
•For each vertex v, find it immediate
neighbors in the same grid cell and
construct a sub-graph
clique size
Let estimate ηmax(a, f)
Locate the immediate neighborhood of a
and f, {a, b, c, d, e, f, g}. After sorting the
degree array in descending order, we have
array
6(a), 6(f), 5(d), 4(b), 4(c), 4(e), 3(g).
=5? =6? =7?
Experimental study on real datasets
DBLP: co-authorship graphs.
DBLP: v 2819, e 54990
Two groups of
German researchers
Peaks in DBLP CSV plot represents different research groups
279
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
SMD: Stock Market Data
Bridging vertex
Partial clique
Partial clique
Peaks in SMD CSV plot
represents highly cohesive
stocks
DIP: Database of interacting proteins
8
SMD3
89
LSM8
PRP4
89
LSM2
PRP8
89
DCP1
PRP6
89
LSM6
LUC7
89
LSM3
SMX2
89
LSM4
SNP1
89
PAT1
STO1
89
LSM7
NAM8
89
LSM5
SNU71
8
PRP31
8
YHC1
8
PRP40
8
MUD1
8
SNU56
9
10
Structure of a nucleotide-bound Clp1-Pcf11
Christian G. Noble, Barbara Beuth, and Ian
A. Taylor*. Nucleic Acids Res. 2007 January;
35(1): 87–99.
“CPF is also required in both the cleavage
and polyadenylation reactions. It contains a
core of eight subunits Cft1, Cft2, Ysh1, Pta1
Mpe1, Pfs2, Fip1 and Yth1”
280
PFS2
RNA14
10
FIP1
10
REF2
10
CFT1
10
CFT2
10
MPE1
10
GLC7
10
PAP1
10
PTA1
10
YSH1
10
YTH1
10
PTI1
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Experimental Study
CSV as a pre-selection step
How?
•Apply CSV to identify potential
cohesive sub-graphs first.
•Use exact algorithm CLAN to run on
these candidates.
Result
•Get the exact cohesive sub-graphs as
running CLAN alone.
•Saves 28-84% of the time compared
to running CLAN alone.
CSV as a pre-selection methods
DNgraph mining: A triangle based approach
• Mining dense patterns out of an extremely large graph
•When the graph is extremely large, it is even difficult to mine
dense patterns.
• An iterative improvement mining approach is more desirable
•Users are able to obtain the most updated results on demand.
• Dense patterns have strong connection with triangles inside a
graphs.
• This has already observed and explained by the preferential
attachment property of large scaled graphs.
281
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
DNgraph mining: A triangle based approach
• What makes a pattern dense? Intuitively
•A collection of vertices with high relevance.
•They share large number of common.
• With that we propose the definition of
Dngraph
A’
•A DNgraph is the largest sub graph sharing
the most neighbors.
•Require each connected vertex pair sharing at
least λ neighbors.
B
C
A
D
F
E
λ(G) = 3, λ(GA’)=0
Compare Dngraph with other dense pattern
definition
• Two interesting patterns
• 4-clique and a Turan graph T(14, 4) [14 vertices, 4 groups, fully
connected between groups]
• If mining quasi-clique, may ends up discovering 1 pattern, as in
(d)
• If searching for closed clique, may only find (e)
282
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
DNgraph mining: challenge
• Find common neigbhors for every connected vertices is
expensive
•Require O(E) join operations.
•Need random disc access.
•In fact, finding an DN-graph is an NP-problem.
• Solution
•Using triangles that two vertices participates to approximate
common neighbor size.
•Iterative refine the approximation following graph edge’s locality.
DNgraph mining: How
1. Initially: count # triangles each edge participates.
•Sort vertices and its neighbors in descending order of their degrees
•Scan the graphs to get # triangles for every vertex.
•The # triangle set the initial value of λ .
2. Next, Iteratively refine λ for every vertex
•Using streams of triangles.
•Iterative refine λcur.
283
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Triangle Counting: how?
1. Sort vertices and its neighbors in descending order of
their degrees
a
e
f
g
b
c
d
h
a
bde
e
dbacgf
b
acde
d
eacgh
c
bde
b
edac
d
acegh
a
edb
e
abcdgf
c
edb
f
eg
g
edf
g
def
f
eg
h
d
h
d
Sort
Triangle counting (cont.)
1. Sort vertices and its neighbors in
descending order of their degrees
2. Join neighborhood for triangle count for
every edge
• The two vertices inhibits locality, due to
reordering and preferential attachment
property of large graphs
a
c
3
f
g
b
3
284
e
d
h
e
dbacgf
d
eacgh
b
edac
a
edb
c
edb
g
edf
f
eg
h
d
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Triangle counting (cont.)
a
1. Sort vertices and its neighbors in
descending order of their degrees
2. Join neighborhood for triangle count for
every edge
3. Use that as the initial λ value for every
edge/vertex
• Vertex λ value is the maximal edge λ value
it participates
•λcur(e) = 3
c
d
vertex
e
3
d
3
…
…
• Initially: count # triangles each edge participates.
• Next, Iteratively refine λ for every vertex
•Using streams of triangles.
•Iterative refine λcur.
f
g
b
DNgraph mining: How (cont.)
285
e
λcur
h
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Triangle stream
•Follow the same order of visiting graph during triangle
counting
•Triangles are not materialized, saving storage
n1
nx
n2
n1
b n1
a
a
nx
a
lambda=k
n2
n2
b
b
n1
nx
nx
a
lambda=k
b
a
lambda=k
b
Iteratively refine λ
•Follow the same order of visiting graph during triangle
counting
•Triangles are not materialized, saving storage
•For every vertex v, when its triangles come, bound λcur(v)
using two other vertices’ λcur
286
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
Iteratively refine λ (cont.)
a
• Initially: count # triangles each edge
participates.
• Next, Iteratively refine λ for every vertex
•Using streams of triangles.
•Iterative refine λcur.
• Until all vertices’ λcur are converged
e
f
3
b
3
c
d
vertex
λcur
e
3
b
3
…
…
DNgraph: Experiment
•Large scaled graph
•Flicker Dataset with with 1,715,255 vertices an 22,613,982
edges.
•1 iteration requires 1 hour, a workstation with a Quad-Core
AMD Opteron(tm) processor 8356, 128GB RAM and 700GB
hard disk.
•Converge in 66 iterations, almost stable after 35 iterations
287
g
h
Mining and Searching Complex
Chapter 6 Structures Massive Graph Mining
• Abstraction
Within the triangulation algorithm. The abstraction ensures
our approach’s extensibility to different input settings.
• Iteratively refine results
• The estimation of common neighborhood improves along
every iteration, users are able to obtain the most updated
results on demand.
• Pre-collection of Statistics to support effective buffer
management
• Process can be easily mapped to key-&gt;value pair for
further distributed processing.
Reference (partial)
[Hu05] H.Hu, X.Yan, Y.Huang, J.Han, and X.J.Zhou. Mining coherent dense subgraphs across
massive biological networks for functional discovery. Bioinformatics, 21(1):213--221, 2005.
[Ng01] A.Y. Ng, M.I. Jordan, and Y.Weiss. On spectral clustering: Analysis and an algorithm.
Advances in Neural Information Processing Systems, volume~14, 2001.
[Karypis96] G.Karypis and V.Kumar. Parallel multilevel k-way partitioning scheme for irregular
graphs. Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on
Supercomputing (CDROM), page~35, Washington, DC, USA, 1996. IEEE Computer Society.
[Breitkreutz03] B.J.Breitkreutz, C.Stark, and M.Tyers.Osprey: a network visualization system.
Genome Biology, 4, 2003.
[Mellor04] J.W.J. Z., Mellor and C. DeLisi. An online visualization and analysis tool for biological
interaction data. BMC Bioinformatics, 5:17--24, 2004.
[Zeng06]J. Wang, Z.Zeng, and L. Zhou. Clan: An algorithm for mining closed cliques from large
dense graph databases. Proceedings of the International Conference on Data Engineering},
page~73, 2006.
[Turan41] P. Turan. On an extremal problem in graph theory. Mat. Fiz. Lapok, 48:436–452, 1941
[Ankerst99] M.Ankerst, M.Breunig, H.P. Kriegel, and J.Sander. OPTICS: Ordering points to
identify the clustering structure. Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'99), pages 49--60, Philadelphia, PA, June 1999.
[DNgraph10] On Triangle based DNgraph Mining. NUS technical report TRB4/10
288
```