Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science & LUMC - Leiden University Medical Center BioRange • Bioinformatics for microarray technology • Bioinformatics for proteomics and metabolomics • Integrative bioinformatics • Vl-e informatics for bioinformatics applications • Test bed with “real-life applications” Biorange CIM, AIM in BioINF Five research lines: • Information Structuring • Heterogenous Data Integration • Advanced Mining Algorithms • Data Interlinking and Integration • Data Storage and Management 1: Advanced Mining Algorithms Data Mining • Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data useful novel, surprising comprehensible valid (accurate) Data Mining • It is somewhat comparable to statistics (and often based on the latter), but takes it further in the sense that whereas statistics aims more at validating given hypotheses, in data mining often millions of potential patterns are generated and tested, in the hope of finding some that are potentially useful. Intelligent Interfaces Case study: SNP data • Genome scan comprising 500K data points (Single Nucleotide Polymorphisms or SNPs) in 900 subjects from families expressing survival to extremely high ages (longevity). • The analysis of this set of 450 million data points is to recognize patterns specific for the genetic make-up of long survivors. Case study: SNP data • The genetic scan data will be combined with • gene expression data (30,000 data points per subject in 100 subjects), • protein data (NMR spectra from blood parameters in hundreds of subjects) and • imaging data (quantitative photography of facial ageing parameters). Case study: SNP data • Subjects with SNP’s • Classes (Young, Old) • Above a certain support within Y,O • Above a certain difference between classes Y,O • Above a certain correlation with a class Y,O • etc Substructures • Sequences • DNA • Trees • XML documents • Graphs • Molecules GASTON Tools hms.liacs.nl Mutagenicity data set of 4069 compounds (56% mutagenic) www.cheminformatics.org To boldly go where no chemist has gone before 08 February 2006 Studying the interactions between different molecular fragments is taking researchers to the uncharted regions of chemical space. © NASA-JPL Chemical space, consisting of all possible stable molecules, is mind-bogglingly vast. Theoretical chemists have calculated that there are more possible molecules based on hexane (10**29) than there are stars in the visible universe. Chemists have only made fairly tentative journeys into this space, with the largest chemical databases currently containing up to 25 million different molecules. Ad IJzerman from Leiden University, the Netherlands, and colleagues realised that analysing these chemical databases could reveal which regions of chemical space have been extensively explored and which remain relatively uncharted. IJzerman’s team split the 250 000 molecular structures contained in the US National Cancer Institute’s database into component fragments, consisting of rings, substituents and several types of linkers. This generated 65 000 different fragments, of which the vast majority (70 per cent) occurred only once. The chemists selected the 1730 fragments that occurred in more then 20 different molecules and calculated the number of times that each possible pair of fragments occurred in the same molecule. Some pairs of fragments were commonly found together, forming what the researchers termed ‘chemical clichés’, but others were rarely found in the same molecule. By generating molecules containing the fragments that aren’t often brought together, predict the researchers, chemists should be able to open up new areas of chemical space and potentially discover new molecules with interesting properties. IJzerman has already demonstrated the benefits of this fragment analysis to a medicinal chemist. She was having problems with a particular compound and he suggested possible alternative ring systems, based on his list of the most popular ring fragments. ‘It turned out that one of our top 40 ring systems was actually her intended modification, reached after much deliberation,’ he told Chemistry World. • 2: Data Storage and Management Patternbases • Pattern Databases = Patterns + Data • Query Languages work on Patterns + Data • Since patternbases provide an architecture for pattern discovery and a means to discover and use those patterns through the query language, data mining becomes in essence an interactive querying process. Patternbases • Derive new patterns from data + old patterns • Apriori Algorithm: Frequent Item Sets • Frequent Items Sets + Data: Assocation Rules Patternbases • Derive new patterns from data + old patterns • Find all item sets that are correlated with classes • Fix a • We can prune the search space by only considering frequent item sets with minimum support Patternbases Research Lines Biorange Five research lines: • Information Structuring • Heterogenous Data Integration • Advanced Mining Algorithms • Data Interlinking and Integration • Data Storage and Management