Indexing for Inexact matching

advertisement
Exact and Inexact Graph
Matching with applications in
Biology
Bioinformatica
27-05-2011
Network comparison – Lipari International Summer School – July 3-10, 2010
BIBLIOGRAPHY
DI NATALE R, FERRO A., GIUGNO R, MONGIOVI' M,
PULVIRENTI A, SHASHA D
SING: Subgraph search In Non-homogeneous Graphs.
BMC BIOINFORMATICS, vol.11:96,2010.
MONGIOVÌ M, DI NATALE R, GIUGNO R, PULVIRENTI A,
FERRO A., SHARAN R.
A set-cover-based approach for inexact graph
matching.
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL
BIOLOGY,vol. 8, 199—218, 2010
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Outline
•
•
•
•
•
•
•
•
•
•
Motivation
Exact matching and Graph Indexing
Indexing large graphs
Indexing for inexact matching
A Set-cover based approach
Multiset multi-cover and a greedy algorithm
A tight lower bound for the optimal cover
Experimental analysis
Application on protein complexes
Conclusion and future work
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Searching on molecular compounds
matches
N
query
H
H
H
N
O
H
H
N
O
O
H
H
C
N
H
N
C
H
H
H
H
C
H
H
N
H
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Searching on protein complexes
Query a complex of a
species over a database of
complexes of another
species
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Exact Graph Matching
Given two graphs G1 = (V1, E1, , l), G2 = (V2, E2, , l),
an isomorphism (that respects the labels)
between G1 and G2 is a bijection  : V1 V2 so
that:
• (v, u)  E1  ( (v),  (u))  E2
• l(u) = l( (u)),  u  V1
A subgraph isomorphism between G1 and G2 is an
isomorphism between G1 and a subgraph of G2.
We say that a graph G1 admits an exact match in
G2 if there exist a subgraph isomorphism
between G1 and G2.
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Subgraph Isomorphism
The subgraph isomorphism problem is NPhard. Several algorithms (Ullmann, Nauty,
VF2) and tools (NetMatch) have been
proposed
If we want to search for a query in a
database of graphs, it may take a long time.
For this reason, indexing systems have been
recently proposed to obtain a reasonable
response time
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Graph Indexing Systems
Feature-based graph indexing systems: they consider
a set of “features” F and filter out all graphs of
the database which do not contain at least one
feature of F contained in the query. They use an
inverted index to organize the features.
E.g.: gIndex, TreePi, GraphFind
Non-feature based graph indexing systems: the
graphs of the database are usually arranged on a
tree (R-tree or B-tree like). This systems are more
suitable for frequent updates.
E.g. CTree, GCoding
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Features
Each system define its own set of features. Some
examples of features are:
•
•
•
Small graphs (gIndex, FGIndex) : To limit the number of
features, they consider the set of frequent subgraphs.
Trees (TreePi) : Since trees have a center it is possible to
improve the filtering phase by considering the distances
between centers.
Paths (SING) : Paths have a starting point. This info can
be used to improve filtering and matching. Moreover
finding paths is more efficient than finding subgraphs.
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Example
Consider as
features all
paths of
length 2
G
Q
FG
set of
feature
occurrences
1
2
1
1
FQ
set of
features
3
1
1
2 missing
occurrences
missing
features
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Graph Indexing Schema
The basic scheme considers three phases:
1. Preprocessing: each graph of the database is
examined in order to extract all features which it
contains. The features are organized
in an inverted index
1. Filtering: the query is examined in order to
extract the set of features which it contains, and
a candidate graph set is computed by comparing
the set of features of the query with the set of
features of the graphs
2. Matching: each candidate graph is examined in
order to verify if there are matches
Subgraphs
Trees
Paths
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Example
Graph DB
preprocessing
index
g1
g2
g1
g3
g4
f1
Q
f1
f2
f2
f3
g3
f4
f4
g1
f5
f5
g1
f6
g6
g1
g3
g6
Set of
candidates
filtering
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
SING
Consider edges as features. Note that AB and AC are contained
in both g1 and g2 but only g1 contains the query.
How can we distinguish these cases?
Both features AB and AC start from a single vertex A in g1 and
q but not in g2.
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
SING index
We consider as features all the paths of length up to lp
(by default lp = 4)
We consider a global inverted index and a local index
for each graph
v4
v1
g1
3
g2
g4
1
3
f1
f2
g1
2
f1
5
f2
f3
g3
3
f4
g1
1
g1
7
g6
1
f5
g3
f4
10010000
10000100
00010000
f5
g3
4
g6
3
10011101
f6
global index
local index of g1
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Query processing
For each feature f of the query, take the set of
graphs in which f occurs a number of time greater
than or equal to the number of occurrences in the
query. Compute the intersection of all taken sets.
2. For each graph of the resulting set, use the local
index to compute a mapping between vertices of the
query and vertices of the graph.
3. Discard all graphs so that at least one vertex of the
query doesn’t have any corresponding vertex in the
graph.
4. Assign new labels to the vertices based on the
mapping. The new labels make the verification phase
faster.
1.
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Comparison – molecules
(AIDS dataset)
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Comparison – TRN E.Coli annotated
with gene espression data
• 22 copies of the Transcr. Reg. Network of E. Coli
• Gene expression profiles of 22 strains of E. Coli K12
• Each network labeled with the gene expression profile of a
different sample. 5 labels: very low, low, medium, high, very
high.
• Motifs (by Uri Alon) as queries
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Comparison – Single graph
(synthetic)
•
•
•
•
•
Scale-free network
2000 nodes
4000 edges
8 labels
Queries extracted at
random
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
The importance of inexact matching
In certain application domains, exact matching is
too restrictive because misses partial
matches, which can give useful information. In
this case, inexact matching is greatly
advantageous.
E.g. molecular compounds: partially matching
substructures can preserve important
chemical properties
E.g. protein complexes: we want to look for a
protein complex of a species in a database of
protein complexes of another species, in order
to identify conserved complexes. Rarely the
topology is fully conserved
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Indexing for Inexact matching
GRAFIL: transforms the edge deletions into feature misses and
computes the maximum number of feature misses allowed. To
improve the results it applies a multi-filter strategy
considering several groups of features separately
SIGMA: given a maximum number of edge deletions, it
transforms the filtering problem into a variant of Set-cover
SAGA: handles deletions and mismatches. It compares
fragments (groups of nodes satisfying a maximum distance
constraint) of the query with fragments of each target graph
and build a compatibility graph among matching fragments. A
clique on the compatibility graph is a candidate match. SAGA
uses a different concept of distance between graphs, so its
applicability is limited in domains which require to control the
number of deletions
CTree: find the subgraphs whose edit distance from the query
is low. The distance computation is approximated, so it can
produce false negatives
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Inexact matching – edge deletions
Q
deletions
G
• Some edges in the query
can be missed in the
graph (deletions)
• Grafil and SIGMA fix a
maximum number of
deletions d and look for
all matches obtained
deleting from the query
a number of edges less
than or equal to d
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Managing edge deletions
• Each edge is associated to
the set of features that
contains it.
F2
F1
Q
1
• GRAFIL  How many
features of Q can be missing
in a target graph ? 
Maximum coverage problem
2
F3
3
4
F4
• SIGMA  Given the set of
features of a target graph, is
it “consistent” with Q and a
maximum number of deletions
d ?  Multiset multi-cover
problem
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Feature count vs identity
A
A
A
A
B
B
A
A
A
A
A
A
G
Q
A
B
A
3
A
B
A
3
A
A
B
3
A
A
B
1
• Search for Q with 1
allowed edge deletion
• The maximum number of
feature misses is 3
(considering all the
occurrences)
• G have 2 feature misses,
so it cannot be discarded
• If we look at the
identity of features, we
note that G misses 2
features of kind AAB,
that are sufficient to
assert that Q cannot be
contained in G
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
SIGMA- admitting one deletion
F2
F1
Q
1
2
3
4
F
• Given a graph G, if Q is
completely contained in G all
features of F must be
contained in G.
• If the edge 1 is missing, the
features in F1 can be missed
in G
• If the edge 2 is missing, the
features in F2 can be missing
in G and so on…
• In general if we admit
maximum one deletion, all
features of F – Fi must be
contained in G for some i  E
• The missing features in G
must be contained in Fi for
some i  E
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Generalizing to more deletions
Given a graph G, find the minimum size set
of edges  such as: FQ  FG   Fe
•
e
Q
1
G
2
3
•
4
FG
F4
•
F2
F1
FQ-FG
F3
This corresponds to find
the minimum number of
edges which have to be
deleted to be G a
candidate to match
The defined problem is
the classical Set-cover
problem
Since a feature can
occur several times, we
consider instead the
Multiset multi-cover
problem, with the
further constraint that
a set can be taken only
once(Vazirani)
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Multiset multi-cover
Y
S
X1
X2
X5
X4
X3
• We have multisets
(each element has a
multiplicity)
• Find the min-size
subfamily of S whose
union contains Y (in
respect of the
multiplicity)
• E.g. {X2,X3,X4} is a
cover for Y
26
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Multiset multi-cover
• Multiset multi-cover, like Set-cover, is NP-hard
but…
• There is a greedy algorithm which can solve it in
polynomial time with bounded error
• We can compute a lower bound for the size of the
cover, which we can use to prune the database of
graphs. For the filtering to be effective we need a
tight lower bound.
• Given a graph G, if the computed lower bound for
the cover is greater than the maximum number of
allowed deletions then G can be discarded
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
A tight lower bound
•
•
•
•
Y is the multiset to cover and S is the
input family of multisets
When XS is taken, assign a cost to
each element instance of X, spreading
an unitary cost over all the newly
covered feature occurrences
Consider the occurrences of each feature numerated by the order they
are covered, and let cost(f, i) be the cost assigned to the i-th occurrence
of f.
Let * be the exact cover, mX (f) and mY(f) the multiplicity of f in X and
Y, and rX (f) = min(mX (f),mY(f))
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Lower bound proof
Proof. We prove that:
The thesis obviously implies since * is one of the ‘  S which
satisfies the condition under the min operator
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Computing the lower bound
• During the execution of the greedy algorithm,
we compute  and, for each set X, the quantity
fX rX(f) (f).
• The minimum-size ’ is obtained by taking the
sets which have the greatest values of fX rX(f)
(f)
• More precisely, the sets of S are ranked by
fX rX(f) (f) in descending order, then they are
taken one by one until the total is greater than
or equal to || + 
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Query processing
1. Extract the features from the query.
2. Build a family of sets of features S (each set
associated to an edge of the query)
3. For each graph
a)
b)
c)
d)
e)
Compute the set of missing features Y
Apply the greedy algorithm for multiset multi-cover on
(S,Y)
Compute the lower-bound
If the lower-bound is less than or equal to the maximum
number of allowed deletions then check if there is a
match
Otherwise discard the graph
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Experimental analysis - molecules
• Comparisons of our approach
(SIGMA) against GRAFIL
and a layman approach
(Edge), over a database of
40.000 molecular compounds
• All methods use paths with
length up to 4 as features
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Experimental analysis – query time
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Application on protein complexes
Human
Yeast
Protein complexes
cross-comparison
Find all protein complexes
of yeast which contain a
protein complex of human
with up to 4 deletions
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Material
• 785 Human complexes from CORUM
• 284 Yeast complexes from SGD
• The topology was inferred from the PPI
networks (BioGRID)
• The vertices were labeled according to the
BLAST score (similar proteins are assigned
with the same label)
All-pair-BLAST on yeast and human proteins
Average-linkage hierarchical clustering with
score cutoff 40 and a maximum size 100.
Proteins in the same cluster are labeled
together
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Experimental analysis - complexes
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Experimental analysis - complexes
LSm2-8 complex
Small nucleolar
ribonucleoprotein complex
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Conclusion
Exact matching  SING
Use node locality information to improve filtering
Identify and filter nodes of the target network that
cannot belong to a match
Reassign labels to improve the matching phase
Inexact matching  SIGMA
Efficient filtering based on Multiset multi-cover
Greedy algorithm
A tight lower bound for the optimal cover
Applications
Molecular compounds
Transcription Regulation Networks
Protein complexes
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Future directions
• Multi-label management
Support generic associations between query
nodes and target nodes (e.g. all-pair-BLAST)
Support labels that have a hierarchical
structure (e.g. GO)
Manage wildcards
• Managing bounded and unbounded paths
Distance and reachability queries with label
constraints
• Inexact matching on large graphs
Methods for exact matching do not work well
Manage matches sharing a large common
component
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
Future directions
• Find high scored matches (with respect to a
scoring function)
Edge weights
Node similarity
• Secondary memory management
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
The Jacob T. Schwartz International School
for Scientific Research
(LIPARI SCHOOL)
http://lipari.cs.unict.it/
School Director
Professor Alfredo Ferro, Ph.D.
Department of Mathematics & Computer Science
University of Catania
Viale A.Doria, 6 - 95125 Catania - ITALY
Tel: +39 095 7383071
Fax: +39 095 330094
E-mail: ferro@dmi.unict.it
Network comparison – Lipari International Summer School – July 3-10, 2010
Jacob T. Schwartz International School for Scientific Research
Biological Sequence Analysis and
High Throughput Technologies
Lipari July 2 – July 9, 2011
Speakers
Soren Brunak,Center for Biological Sequence Analysis; Technical University of Danmark
Bud Mishra, New York University
Itzik Peer, Columbia University in the City of New York
Paola Sebastiani, Boston University
Guest Lecturers
Carlo Croce, Ohio State University
Gene Myers, HHMI
Roded Sharan, Tel Aviv University
School Directors
* Prof. Alfredo Ferro (University of Catania)
* Prof. Raffaele Giancarlo (University of Palermo)
* Prof. Concettina Guerra (University of Padova and Georgia Tech.)
* Prof. Michael Levitt, (Stanford University)
* Dr. Rosalba Giugno (co-director, University of Catania)
* Dr.
Alfredo Majorana“
Pulvirenti
of Catania)
Network
comparison
– Lipari International
Summer
School –(co-director,
July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre
"Ettore
- Erice
- University
September
8 - 16, 2010
Jacob T. Schwartz International School for Scientific Research
Game Theoretic approach to Computational Complex
Systems
Lipari July 9 – July 16, 2011
Doyne Farmer, Santa Fe Institute – LUISS Rome
The complex dynamics of complicated games
Herbert Gintis, Santa Fe Institute - Central European University - Collegium
Budapest
The Dynamics of Market Economies
Dirk Helbing, ETH Zurich, Swiss Federal Institute of Technology Zurich
Social cooperation, norms and conflicts: A game-theoretical approac
Tim Roughgarden, Stanford university
Reward and punishment in Public good Games.
School Directors
* Prof. Alfredo Ferro (University of Catania)
* Prof. Dirk Helbing (ETH Zurich)
Karl Sigmund, University of Vienna
* Prof. Andrea Rapisarda (University of Catania)
Reward and punishment in Public good Games.
* Prof. V.S. Subrahmanian (University of Maryland)
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
4° International Conference on Similarity Search and
Applications
Lipari June 30 – July 1, 2011
Invited Speakers
Roded Sharan, Tel Aviv University
Paolo Ferragina, Università di Pisa
http://www.sisap.org/
Network
comparison
– Lipari International
SummerMajorana“
School – July
3-10,
2010
Optimization, Machine
Learning
and Bioinformatics
– Centre "Ettore
- Erice
- September
8 - 16, 2010
THANK YOU!
http://ferrolab.dmi.unict.it/
Network comparison – Lipari International Summer School – July 3-10, 2010
Download