Outline A tool for evaluating strategies for grouping of biological data

advertisement
Outline
A tool for evaluating strategies for
grouping of biological data
Vaida Jakonienė, Patrick Lambrix
Motivation
„ Method for similarity based grouping
„ KitEGA – illustration
„ Summary and future work
„
2
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Tools for biological data analysis
Similarity of biological data
Similarity between data entries
Hierarchical microarray clustering
(J-Express Pro)
Sequence alignment (BLAST)
Lord PW, Stevens RD, Brass A, Goble CA.
Bioinformatics, 19(10):1275-83, 2003.
Classification of abstracts
„Basic
task – computation of a
similarity value between objects
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
3
Similarity-based grouping
„
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
4
Similarity-based grouping
Not a trivial task
… data
is complex
… many grouping algorithms available: which
algorithm performs best for which grouping
task?
… grouping on which attributes?
… existing grouping algorithms may not be
applied straightforward to new data sets
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
„
5
Environments that support study,
comparison and evaluation of different
grouping strategies are needed
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
6
1
Method for similarity-based grouping
Outline
Domain independent
sim. funct.
Motivation
„ Method for similarity based grouping
„ KitEGA – illustration
„ Summary and future work
Domain dependent
sim. funct.
Grouping
attributes
„
Library of
similarity funct.
Specification of
grouping rules
Data source
Pairwise grouping
Other
knowledge
Grouping
Evaluation
Library of
classifications
Analysis
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
7
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
8
Outline
Motivation
„ Method for similarity based grouping
„ KitEGA – illustration
„ Summary and future work
„
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
„
9
Idea
„
A toolKit for Evaluating Grouping Algorithms
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
10
KitEGA framework
Input components (plug-ins)
… grouping
procedures to be evaluated
sources
… evaluation methods
… classifications
… other knowledge
… data
„
„
Tool executes algorithms and stores results
User analyzes results using different views
on the result data
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
11
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
12
2
Illustration
„
Data entry
Grouping task. Grouping of proteins with
respect to
… biological
… class
„
function
of isozymes they belong to
Data source(s)
… human
… via
proteins involved in glycolysis
Entrez retrieved 190 data entries
Entrez. Protein database
13
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Data entry
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
14
Data entry
GOann
Sequence
15
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
- only terms of GO function ontology analyzed
- only data entries having GO terms
GOann, 67 data entries
„
Keywords
Ec_number
… SeqSim(v1,v2)
… SemSim(v1,v2)
Grouping rules
„
Grouping methods
… GO
GOcomb, 93 data entries
ontology
… Connected
GOann
(Æ GO ontology)
„
spkw2go
ec2go
Library of similarity functions
… EditDist(v1,v2)
GO Consortium. Mappings between data values and ontological terms:
ec2go – ec_numbers translated into GO terms
spkw2go – swissprot keywords translated into GO terms
DS2:
16
Grouping components
Data sources
DS1:
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
components
… Cliques
DS3:
Ec_number
ec2go
GOcomb, 92 data entries
GOann
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
17
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
18
3
Evaluation methods
„
Classifications
Types of quality measures
„
… internal
– based on information obtained during the
grouping
… external – with respect to known classes of the
grouped data
„
Manual classification according to
… biological
… classes
function
of isozymes
In this illustration: external
… Purity
… F-measure
… Entropy
… Mutual
information
19
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Selection of test case
20
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Specification of grouping rules
Specification of
grouping rules
Pairwise grouping
Grouping
(DS3)
(DS3)
Evaluation
Analysis
21
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Specification of
grouping rules
Pairwise grouping
Pairwise grouping
22
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Specification of
grouping rules
Grouping
Pairwise grouping
Grouping
Grouping
Evaluation
Evaluation
data entries in a group
directly or transitively
similar to each other
(ConnectedComponents)
Analysis
all pairs of
data entries
compared
Analysis
all data entries in a group
similar to each other
(Cliques)
(DS3)
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
23
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
24
4
Specification of
grouping rules
Grouping
Pairwise grouping
Specification of
grouping rules
Evaluation
Pairwise grouping
Grouping
Grouping
Evaluation
Evaluation
Analysis
Analysis
„
„
„
„
25
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Entropy: average distribution of the data entries in each
group among the classes
Purity: average precision of the groups with respect to
their best matching classes
Mutual information: correspondence on average
between each group and class
F-measure: precision and recall of the classes with
respect to their best matching groups on average
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
26
Specification of
grouping rules
Analysis
Pairwise grouping
Analysis
Grouping
Evaluation
Analysis
true positives
false positives
false negatives
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
27
Analysis - comparison
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
Test cases. Observations
„
Best suited grouping approaches. For data
source Glyc-Funct-AnnEc-onlyGO (DS3)
…
…
„
SemSim(GOcomb) for grouping on biological function
SeqSim(Sequence) for grouping on classes of isozymes
Suitability of mappings for the used grouping
approaches
…
…
„
28
spkw2go – too general, e.g. ’Glycolysis’
ec2go – specific enough, e.g. ’6-phosphofructokinase activity’
Comparisons: use of different data sources, grouping
algorithms, and classifications, grouping on different
attributes, impact of threshold
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
29
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
30
5
Summary and future work
„
„
„
„
Motivated need for environments that support the
evaluation and comparison of similarity-based grouping
procedures
Implemented the KitEGA tool based on a method for
evaluating similarity-based grouping algorithms
Illustrated KitEGA using test cases based on different
strategies and classifications
Extend the Kitega implementation
V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden
31
6
Download