Outline A tool for evaluating strategies for grouping of biological data Vaida Jakonienė, Patrick Lambrix Motivation Method for similarity based grouping KitEGA – illustration Summary and future work 2 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Tools for biological data analysis Similarity of biological data Similarity between data entries Hierarchical microarray clustering (J-Express Pro) Sequence alignment (BLAST) Lord PW, Stevens RD, Brass A, Goble CA. Bioinformatics, 19(10):1275-83, 2003. Classification of abstracts Basic task – computation of a similarity value between objects V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 3 Similarity-based grouping V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 4 Similarity-based grouping Not a trivial task data is complex many grouping algorithms available: which algorithm performs best for which grouping task? grouping on which attributes? existing grouping algorithms may not be applied straightforward to new data sets V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 5 Environments that support study, comparison and evaluation of different grouping strategies are needed V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 6 1 Method for similarity-based grouping Outline Domain independent sim. funct. Motivation Method for similarity based grouping KitEGA – illustration Summary and future work Domain dependent sim. funct. Grouping attributes Library of similarity funct. Specification of grouping rules Data source Pairwise grouping Other knowledge Grouping Evaluation Library of classifications Analysis V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 7 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 8 Outline Motivation Method for similarity based grouping KitEGA – illustration Summary and future work V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 9 Idea A toolKit for Evaluating Grouping Algorithms V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 10 KitEGA framework Input components (plug-ins) grouping procedures to be evaluated sources evaluation methods classifications other knowledge data Tool executes algorithms and stores results User analyzes results using different views on the result data V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 11 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 12 2 Illustration Data entry Grouping task. Grouping of proteins with respect to biological class function of isozymes they belong to Data source(s) human via proteins involved in glycolysis Entrez retrieved 190 data entries Entrez. Protein database 13 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Data entry V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 14 Data entry GOann Sequence 15 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden - only terms of GO function ontology analyzed - only data entries having GO terms GOann, 67 data entries Keywords Ec_number SeqSim(v1,v2) SemSim(v1,v2) Grouping rules Grouping methods GO GOcomb, 93 data entries ontology Connected GOann (Æ GO ontology) spkw2go ec2go Library of similarity functions EditDist(v1,v2) GO Consortium. Mappings between data values and ontological terms: ec2go – ec_numbers translated into GO terms spkw2go – swissprot keywords translated into GO terms DS2: 16 Grouping components Data sources DS1: V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden components Cliques DS3: Ec_number ec2go GOcomb, 92 data entries GOann V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 17 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 18 3 Evaluation methods Classifications Types of quality measures internal – based on information obtained during the grouping external – with respect to known classes of the grouped data Manual classification according to biological classes function of isozymes In this illustration: external Purity F-measure Entropy Mutual information 19 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Selection of test case 20 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Specification of grouping rules Specification of grouping rules Pairwise grouping Grouping (DS3) (DS3) Evaluation Analysis 21 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Specification of grouping rules Pairwise grouping Pairwise grouping 22 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Specification of grouping rules Grouping Pairwise grouping Grouping Grouping Evaluation Evaluation data entries in a group directly or transitively similar to each other (ConnectedComponents) Analysis all pairs of data entries compared Analysis all data entries in a group similar to each other (Cliques) (DS3) V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 23 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 24 4 Specification of grouping rules Grouping Pairwise grouping Specification of grouping rules Evaluation Pairwise grouping Grouping Grouping Evaluation Evaluation Analysis Analysis 25 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Entropy: average distribution of the data entries in each group among the classes Purity: average precision of the groups with respect to their best matching classes Mutual information: correspondence on average between each group and class F-measure: precision and recall of the classes with respect to their best matching groups on average V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 26 Specification of grouping rules Analysis Pairwise grouping Analysis Grouping Evaluation Analysis true positives false positives false negatives V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 27 Analysis - comparison V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden Test cases. Observations Best suited grouping approaches. For data source Glyc-Funct-AnnEc-onlyGO (DS3) SemSim(GOcomb) for grouping on biological function SeqSim(Sequence) for grouping on classes of isozymes Suitability of mappings for the used grouping approaches 28 spkw2go – too general, e.g. ’Glycolysis’ ec2go – specific enough, e.g. ’6-phosphofructokinase activity’ Comparisons: use of different data sources, grouping algorithms, and classifications, grouping on different attributes, impact of threshold V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 29 V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 30 5 Summary and future work Motivated need for environments that support the evaluation and comparison of similarity-based grouping procedures Implemented the KitEGA tool based on a method for evaluating similarity-based grouping algorithms Illustrated KitEGA using test cases based on different strategies and classifications Extend the Kitega implementation V. Jakonienė, P. Lambrix. Linköpings universitet, Sweden 31 6