Knowledge Discovery in Biological Databases David Gilbert and Aik Choon Tan {drg,actan}@brc.dcs.gla.ac.uk www.brc.dcs.gla.ac.uk Bioinformatics Research Centre University of Glasgow Outline • • • • Introduction Motivation Overview of KDBD Machine Learning – – – – – Domain representations Knowledge representations Search strategies Classification methods Evaluation & Interpretation • Conclusion Bioinformatics •Bio - Molecular Biology •Informatics - Computer Science •Bioinformatics - the study of the application of • molecular biology, computer science, artificial intelligence, statistics and mathematics •to model, organise, understand and discover interesting knowledge associated with the large scale molecular biology databases, •to guide assays for biological experiments. • “Computational Biology” (USA) Bioinformatics = Machine Learning + Data Mining + Biological Databases = (?Knowledge Discovery in Databases?) (?Knowledge Discovery in (Biological)Databases?) Growth in Sequence Data Growth in Structural Data (Berman et al 2002) Organisms Physiology Organs Tissues Cell signalling Cell Protein-protein interaction (pathways) Protein functions Protein Structures Gene expressions Nucleotide structures Nucleotide sequences Increasing complexity of Biological Data Computational bottlenecks Caused by • Data characteristics – – – – – Lots of it heterogeneous distributed incomplete dirty • (Traditional) complexity issues: time, space • Induction: constructing discriminatory/descriptive functions from large data sets Computational bottlenecks • Data representation – – – – sequences (DNA, RNA, amino-acid) trees (phylogentic,…) graphs (protein structure, biochemical networks) matrices (micro-arrays, metabolic pathways) Molecular biology overview Biological activity: interaction! Knowledge Discovery Biology – A Classification Problem • Biology - The division of physical science which deals with organised beings or animals and plants, their morphology, physiology, origin, and distribution (OED) • Analysis via classification - steps: –Organise examples into family –Find common descriptions to characterise the family members –Look for more members in the Universe –If a new instance matches the characteristics of a family, infer family properties to the new instance and add it as a member Comparative genomics One aspect: making inferences (Eisenberg et al, 2000) Data Explosion Proteome Genome S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 ggggctacgg ggggtggggc ttcgcgcccc gccggcctat aaaagcggcc gccgcggctc cgtgccgttg ccgaccttcg cctgcgccgc tgctgcttcgcgcccgtcgc ctccgccatg gctcccagga agttcttcgt gggtggcaac tggaagatga acggcgacaa gaagagcttg ggcgagctca tccacacgct gaatggcgcc aagctctcgg ccgacaccga ggtggtttgc ggagcccctt caatctacct tgattttgcc cgccagaagc ttgatgcaaa gattggagtt gcagcacaaa actgttacaa ggtaccgaag ggtgctttca caggagagat cagcccagca atgatcaaag atattggagc tgcatgggtg atcctgggcc actcagagcg gaggcatgtttttggagagt ctgatgagtt gattgggcag aaggtggctc atgctcttgc tgaaggcctc ggtgtcatcg cctgcattgg ggagaagctg gatgagagag aagctggcat aacggagaag gtggtctttg aacagaccaa agctattgct gataacgtga aggactggag taaggtggtt cttgcctatg agccagtttg ggctatcgga actggtaaaa ctgctactcc ccaacaggct caggaggttc atgagaagct gagaggctgg ctcaaaagcc acgtgtctga tgctgttgct cagtcaacta ggacgtcta tggaggttca gtcactggtg gcaactgtaa ggaactggcc tcccagcatg atgtggatgg cttccttgtt ggtgggacgt ctctcaagcc agagtttgtg gatattatca atgcaaaaca ttaaagcagc ctgtgaggag cagtccctta cggttaagag caagaaactg aagcaagaag ggaccttgtg ttgcacgtct ctcggtacag aggcttcttc tgaggctttc ccccaccacc acaattattg ttctagctgt gctgctaacc cccaccacct tgttggagtc ccattagtgt gagcccatct cagcagagtc tcctttctga actggcaaaatccttggtta tctgttgagc acgt Data, information, knowledge … • data : nucleotide sequence • information : where are the “genes”. control statement Termination (stop) TATA box start gene Found using classifier, pattern, rule which has been mined/discovered • knowledge : facts and rules If a gene X has a weak psi-blast assignment to a function F –and that gene is in an expression cluster –and sufficient members of that cluster are known to have function F, ⇒ then believe assignment of F to X. Data, Information, Pattern, Knowledge INFORMATION Molecular Weight = 26528 Number of Residues = 247 Number of Alpha = 11 Number of Beta = 8 Content of Alpha = 43.32 Content of Beta = 17.00 PATTERN [AV]-Y-E-P-[LIVM]-W-[SA]-I-G-T-[GK] KNOWLEDGE The DNA sequence encodes an alpha-beta protein with a barrel architecture. The structure of the protein is a TIM-barrel. DATA APRKFFVGGN WKMNGKRKSL GELIHTLDGA KLSADTEVVC GAPSIYLDFARQKLDAKIGV AAQNCYKVPK GAFTGEISPA MIKDIGAAWV ILGHSERRHVFGESDELIGQ KVAHALAEGL GVIACIGEKLDEREAGITEKVVFQETKAIADNVK DWSKVVLAYEPVWAIGTGKTATPQQAQEVHE KLRGWLKTHVSDAVAVQSRIIYGGSVTGGNCK ELA SQHDVDGFLV GGASLKPEFV DIINAKH An abstract view • Given {p:9, p:1, q:8, p:3, q:2, q:6, p:5, q:4, p:7, q:0} • Cluster: {p:9, p:1, p:3, p:5, p:7} {q:8, q:2, q:6, q:4, q:0} • Background knowledge: > + - • Induce: 0 is q X is q if X-2 is q and X > 0 • X is p if not(X is q) What is a pattern? Types of Pattern • Deterministic – is a boolean function which either matches a given object (i.e. sequence, structure) or not R-x-Y-[ST] (e.g. regular expression for sequence pattern) •Probabilistic Assigns each sequence with a probability that generated by the model. The higher the probability, the better is the match between a sequence and a pattern (e.g. Profile for sequence pattern) 1 2 3 4 5 6 7 8 9 10 S1: R V Q R A Y S Y V N S2: P L M R A Y S I A S S3: L V I R P Y T P V S S4: L C M R A Y T P T S S5: E K L R L Y S I A S R=.2 V=.4 Q=.2 R=1 A=.6 Y=1 S=.6 Y=.2 V=.4 N=.2 P=.2 L=.2 M=.4 P=.2 T=.4 I=.4 A=.4 S=.8 L=.4 V=.2 I=.2 L=.2 P=.4 T=.2 E=.2 Motifs Motif : a pattern associated with some biological meaning (e.g. function) 1FDR:_ 1A8P:_ 1NDH:_ 1CNF:_ 1B2R:A 1AMO:A RVQRAYSYVNSP PLMRAYSIASPN LVIRPYTPVSSD LCMRAYTPTSMV EKLRLYSIASTR LQARYYSIASSS FAD binding site Sequence pattern FAD ligand RxY[ST] Structural pattern KDD in BIOINFORMATICS Target Data PreProcessing S1:ACAATG Selection S1:ACAATG S2:TCAACTATC S3:ACACAGC S4:AGAATC S5:ACCGATC PreProcessed Data Transformation Transformed S1:ACA---ATG S2:TCAACTATC Data S3:ACAC--AGC S4:AGA---ATC S5:ACCG--ATC Raw Data KNOWLEDGE!! Pattern Interpretation/ Evaluation Machine Learning Characteristics of KDD • Validity • High-level Patterns/Languages understandable by human • Accuracy - measures of certainty (probability) • Interesting Results - novel, useful and nontrivial to compute • Efficiency - running times for large-sized databases are predictable and acceptable (Frawley et. al. 1992) Data preparation • Select and identify target database • Extract target data set • Transform the target data set into the input format of the learning algorithm • Divide the target data set into groups (training, test sets) • Takes most of KDD process time • Issues: – Dealing with noisy data and missing attributes – Filtering target data set (e.g. statistical analysis for gene expression before performing clustering) Machine learning tasks (in bioinformatics as elsewhere…) • • • • • Classification: predicting an the class of an item Clustering: finding groups of items Characterisation: describing a group Deviation Detection: finding changes Linkage Analysis: finding relationships & associations • Visualisation: presenting data visually to facilitate knowledge discovery by humans (human in the loop) Learning Approaches • Unsupervised approach – given the unassigned examples, group together the examples with similar properties • Supervised approach – given predefined class of a set of positive and negative examples, construct the classifiers that distinguish between the classes Issues • • • • Domain representation Knowledge representation Search strategy Classification method Learning in bioinformatics context • Automatically find pattern (given a training set) • Characterisation: (positive examples only) patterns describing “interesting” properties of a family • Classification: (positive and negative examples) pattern distinguishing S+ and S- .. Which may overlap... • Formal language for descriptions (domain representation) • Scoring function to rate descriptions (knowledge representation) • Algorithm (search strategy and classification methods) Protein comparison & motif discovery Str comparison Structure Prediction Function Prediction Str Classification Str Motif Database Str Database Extract features Match Str Description Discover / Compare Patterns Eidhammer, Jonassen & Taylor, “Structure Comparison and Structure Patterns”, JCB, 7:5 pp 685-716, 2000. Steps • Pattern matching: input is 1 pattern & 1 str;output is “yes”/“no” (deterministic pattern) or score (probabilistic pattern). • Pattern discovery: find patterns matching some/all of input structures (choose patterns with high as possible fitness value to input structures) • Comparison: input (pair of) structure descriptions, find (local/global) similarities, optimise similarity measure, output score. – Similarity may be represented as a pattern Pattern discovery in biosequences • Group together sequences thought to have common biological (structural, functional) properties – families (biological - semantic level) • Study their common syntactic properties ignoring biological (semantic) properties – patterns, clusters (mathematical - syntactic level) • Test whether the discovered patterns make sense (back to semantic level) Approaches to pattern discovery • Pattern driven: enumerate all (or some) patterns up to certain complexity (length), for each calculate the score, and report the best • Sequence driven: look for patterns by aligning the given sequences Brazma et al, Approaches to the automatic discovery of patterns in biosequences, Journal of Computational Biology, 5(2):277-303, 1988 Pattern driven algorithms • Brute force - enumerate all patterns (for instance, all substrings) up to a given length (complexity) • Evaluate their fitness with respect to the input sequences and output the best • Unrealistic for patterns of even modest size even for substring patterns (e.g., for substring patterns of length 10 over the amino acid alphabet, there are more than 1013 different substrings to enumerate in this way) • E.g. PRATT program (Jonassen, U.Bergen, via www.ebi.ac.uk) Sequence driven algorithms • Group similar sequences together (e.g., in pairs); • For each group find a common pattern (e.g., by dynamic programming); • Group similar patterns together and repeat the previous step until there is only one group left Sequence driven approach s1 s2 p1 p4 s3 p2 s4 s5 p3 Characteristic string function for family F+ function g : Σ* → {FALSE,TRUE} FF+ Σ* g(s)= { TRUE if s ∈ F+ FALSE if s ∈ F- Classification & characterisation Problems Classification: + and - examples Clean training data Characterisation: + examples SS+ S+ F- F- F+ F+ Σ* Σ* SNoisy training data S+ S+ F+ F- Σ* F+ F- Σ* (Some) Performance Measurements Specificity Sensitivity TP Sn = TP + FN Positive Predicted Value TN TP Sp = PPV = TN + FP TP + FP 0 ≤ Sn ≤ 1 0 ≤ Sp ≤ 1 0 ≤ PPV ≤ 1 Correlation Coefficient (TP * TN − FP * FN ) cc = (TP + FP ) * ( FP + TN ) * (TN + FN ) * ( FN + TP ) -1≤cc ≤1 cc 1.0 no FP or FN 0.0 when f is random with respect to S+ and S-1.0 only FP and FN Knowledge Representation “If the predictive accuracies of two hypotheses are statistically equivalent then the hypothesis with better explanatory power will be preferred. Otherwise the one with higher accuracy will be preferred.” (Muggleton et al., 1998) Input Learner Classifier •High accuracy •High explanatory power Biological Sequences -nucleotide sequences -protein sequences Domain representation –Example 1 xxx V x x x x x x x C Zn H x \ / x x Zn x x / \ x C H xxxx xxxxxx C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H Edit distance • Levenshtein 1966 • Minimum number of edit operations to transform 1 string into another – insert, delete, substitute (1 symbol) • Score is zero (identical) or positive • E.g “AIMS” & “AMOS” AIMS AMOS ⇒ AIM-S A-MOS AIMS AIMS (score=2 for each solution) AMOS AMOS The possibilities? AIM-S | | | A-MOS Which is better? AIMS | | AMOS Multiple alignments • Analyse gene families – reveal (subtle) conserved family characteristics characters 2 3 4 5 6 7 8 9 10 S1 S2 S3 S4 S5 Y Y F F Y D D E D E G G G G G G G G G A I I A V L L V V V V E E E Q Q A A A A A L L L V L consensus y d G G AI VL sequences 1 V e A l Multiple aligment - methods • Simultaneous: N-wise alignment (adapted from pairwise approach) – uses N-dimension matrix. – Complexity is • O(m1m2) [2 sequences length m1 & m2 ] • O(mn) [n sequences of length m] – Thus only good for short sequences. • Manua1 (!) s1 s2 • Progressive (heuristic) e.g. ClustalW: a1 s3 s4 a2 – compute pairwise sequence identities – construct binary tree (can output phylogenetic tree) s5 – align similar sequences in pairs, add distantly related ones later. a4 a3 Multiple sequence alignment (globins) CLUSTAL W (1.81) multiple sequence alignment Human Gorilla Rabbit Pig VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV VHLSSEEKSAVTALWGKVNVEEVGGEALGRLLVVYPWTQRFFESFGDLSSANAVMNNPKV VHLSAEEKEAVLGLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPKV ***:.***.** .*******:****************************..:***.**** 60 60 60 60 Human Gorilla Rabbit Pig KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK KAHGKKVLAAFSEGLSHLDNLKGTFAKLSELHCDKLHVDPENFRLLGNVLVIVLSHHFGK KAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVVLARRLGH ******** :**:** **********.*******:********:*****:* **::::*: 120 120 120 120 Human Gorilla Rabbit Pig EFTPPVQAAYQKVVAGVANALAHKYH EFTPPVQAAYQKVVAGVANALAHKYH EFTPQVQAAYQKVVAGVANALAHKYH DFNPNVQAAFQKVVAGVANALAHKYH :*.* ****:**************** 146 146 146 146 sequence alignments & phylogenetic trees Pair Human-Gorilla Human-Rabbit Gorilla-Rabbit Human-Pig Gorilla-Pig Rabbit-Pig Score 99 90 89 84 84 83 ((Human:0.00000, Gorilla:0.00685) :0.04110, Rabbit:0.05479, Pig:0.10959); What can we do with multiple alignments? • Create (databases of) profiles derived from multiple alignments for protein families – profile = multiple alignment + observed character frequencies at each position • Search with a sequence against a database of profiles (e.g. PROSITE database) – faster than sequence against sequence – gives a more general result (“the input sequence matches globin profile”) • Search with a profile against a database of sequences – PSI-BLAST : can identify more distant relationships than by normal BLAST search PSI-BLAST (position specific iterated BLAST) Single protein sequence Search database(BLAST) Profile ?iterate until convergence Multiple alignment Estimate statistical significance of local alignments Protein structure Protein structure - levels PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands) VHLTPEEKSAVTALWGKVNVD EVGGEALGRLLVVYPWTQRFF ESFGDLSTPDAVMGNPKVKAH GKKVLGAFSDGLAHLDNLKGTF ATLSELHCDKLHVDPENFRLLG NVLVCVLAHHFGKEFTPPVQAA YQKVVAGVANALAHKYH QUATERNARY STRUCTURE TERTIARY STRUCTURE (fold) TOPS Simplified descriptions of protein 3D structures and their use in searching and structural pattern recognition Domain representation –Example 2 (TOPS Approach) TOPS Example – 2bopA0 chirality H-bond 2bopA0 α-helix loop β-strand • Several examples, with common parts highlighted What is a pattern? • A common description Number of insert SSEs Plait motif (0,N) (0,N) (0,N) (0,N) Correspondences 1,2,4,6,7,8 1,3,4,6,7,8 (0,N) Pattern matching 2bop (0,N) Plait motif Alternative matches 1,2,4,6,7,8 2bopA0 1,3,4,6,7,8 Plait Discovering common patterns and making multiple alignments Pattern P P P Compression: Send the pattern once, and then for each domain, send the uncovered parts Domain 1 Domain 2 Domain 3 Topological description • Consider sequence of SSEs (strand, helices), plus spatial adjacency within fold & approximate orientation • Neglect details (lengths & structures of loops, exact lengths & spatial orientations of SSEs, sequence information...) √ simplicity – implement very fast comparison algorithms, machine learning, ... – detect distant structural relationships X simplicity – relate structures topologically which may have no meaningful biological relationship. Enhanced TOPS TOPS Sequence Information Pattern-Discovery/ Matching Algorithms Biochemical Features TOPS+ & Scoring Functions Structure Comparison Algorithm TOPS+ BASED – PSSM/HMM PROFILES Structural and Functional Assignment (Veeramalai, 2002) Tops + Sequence with Biochemical features Functional information DNA Ligand DNA binding-site Ligand binding-site (Veeramalai, 2002) Feature Extraction A E (Veeramalai, 2002) B C PSSM/HMM Profiles & Scoring Function Key Structure-Based Sequence & Function Extraction For Protein Domains Æ Ligand interaction S1 ÆLigand interaction in loop ÆLigand interacting aa’s ÆSeq segment of a helix Æ Seq segment of a strand ÆSeq segment of a loop S2 Etc., Sn S1 TOPS-BASED Multiple Sequence Alignment Profile Generation S2 Sn SAM/HMMER IMPALA HMM Profile PSSM Profile Scoring Function (Veeramalai, 2002) 1vpt00 Methyltransferase Superfamily 2admA1 1vid00 Key to TOPS 1xvaA2 Ligand binding site in α-helix residues Ligand binding site in β-strand residues 1hmy01 Ligand binding site in loop-region residues Ion-binding site between SSEs & loop regions Conserved Structural Pattern Structural/Functional (TOPS) Pattern (Veeramalai, 2002) Comparing structures - NADP binding domains dihydropterine reductase homo sapiens homo sapiens rat dihyrofoliate reductase E.Coli Dendrogram from pairwise comparisons & Dihydropteridine reductase (human) Dihydropteridine reductase (rat) hierarchical clustering Lactate dehydrogenease (pig) Lactate dehydrogenase (bacterial) Malate dehydrogenase (pig) Malate dehydrogenase (bacterial) Quinone oxido-reductase (bacteria) Alcohol dehydrogenase (human) D-3-phosphoglycerate dehydrogenase (bacteria) NADH peroxidase (bacteria) D-glyceraldehyde-3-phosphate dehydrogenase Dihydrofolate reductase (bacterial) Dihydrofolate reductase (human) NADH peroxidase (bacteria) NAD comparisons Sequence Structure (atomic coordinates) Structure (topology) Hierarchical Machine Learning •Integrate various machine learning techniques •Incorporate patterns induced from different sources •Produce user readable hypotheses Gene Expression Gene - informatics?? Phylogenetic Inferences Connectors To Other Maps Metabolic Profiles Cofactors & Metabolites Sequence Homologs In Other Genomes Metabolic Map Locator Sequence Functional Chemistry Gene X Experimental Data Genome Location Structure Expression Info Raw Images Numerical Values (Adapted from Gibas & Jambeck, 2001) Cluster Genes Raw Data Electron Density Structure Annotation SS Assignment Gene expression Pre-genomics era p1 g1 p2 One gene = One gene product = One behaviour Post-genomics era p1’’ p1’ g1 g2 g3 p1 p2 p3 p4 Many genes = Many gene products = Many behaviours Microarray experiment Spotting the arrays RED = Present (P) = highly expressed, detected by the detector YELLOW = Marginal (M) = expressed, “not sure” for the detector GREEN = Absent (A) = maybe expressed, not detected by the detector Classification Problem (Golub et al 1999) ALL = acute lymphoblastic leukemia (lymphoid precursors) AML = acute myeloid leukemia (myeloid precursor) Characterisation Problem (Stuart et al 2001) Temporal gene expression profiles during kidney development. Data are expressed as the mean at each time for clusters of genes as defined by kmeans clustering (1-5). The distribution of individual profiles is also shown for the most heterogeneous group (2, all). 13, 15, 17, 19, embryonic days; N, newborn; W, 1 week old; A, adult. Characterisation Problem (Stuart et al 2001) Functional associations of gene clusters. Gene clusters varied remarkably in terms of major functional classifications of component genes. Group 1 expressed earlier in nephrogenesis was most notable for genes involved in DNA replication (D), RNA production (R), protein synthesis (P), and morphogenesis (M), consistent with an actively proliferating tissue. Group 2 (which peaked in midnephrogenesis) was most notable for genes of the extracellular matrix (E) as well as morphogenetic genes (M). Group 3 (with a peak in neonatal life) was dominated by retrotransposon transcripts (RT). Group 4 was most notable for transport (T) and energy metabolism (EN) related genes. Group 5 genes (significantly up-regulated in the adult vs. all previous times) was more heterogeneous and included genes specifying catabolic enzymes (C), defense and immune recognition (DE), homeostasis of the organism as a whole (H), detoxification (DT), oxidative stress (RD), and transport (T). Gene expression matrix Rows = genes expression profiles Columns = Different conditions/time points Genes AFFX-b-ActinMur/M12481_3_st AFFX-YEL002c/WBP1_at AFFX-YEL018w/_at AFFX-YEL024w/RIP1_at AFFX-YEL021w/URA3_at 92539_at 92540_f_at 92541_at 92542_at 92543_at A1 TSu74aA1 TSu74aA2 TSu74aA2 TSu74aA3 V10_SiA3 V10_DeB1 V12-A_B1 V12-A_B2 V12-B_B2 V12-B_B3 V12-C_B3 V12-C_C1 P1-A_S 26.1 A 29.7 A 7.7 A 13.2 A 11.4 A 43.7 A 15.1 1.3 A 6.2 A 2.5 A 4.7 A 2.7 A 1.3 A 7 6.1 A 0.6 A 1.8 A 3.1 A 2.1 A 1.4 A 0.6 11.9 A 7.2 A 2.7 A 10.4 A 2.4 A 8.2 A 7.6 11.8 A 6.6 A 2.6 A 12.4 A 6.9 A 7.6 A 6 2475.9 P 2091.3 P 1391.6 P 1407.9 P 1947.2 P 1572.9 P 1999.6 96.9 P 77.4 P 138.7 P 144.8 P 122.6 P 126.6 P 128.8 863.2 P 1920.6 P 1248.1 P 1384.9 P 268 P 352.3 P 856.4 702.4 P 868.3 P 558.4 P 613.1 P 631.8 P 602.1 P 548.3 56.7 P 56.7 P 75.5 P 61.6 P 72.5 P 76.6 P 56.2 Replicates Signal (intensity) Detection Clustering Gene Expression Data • A clustering problem consists of elements & a characteristic vector for each element • A measure of similarity is defined between pairs of such vectors • Elements = genes • Vector = expression level of each gene • Goal: Partition the elements into subsets (clusters) which satisfy: – Homogeneity: elements in the same cluster are highly similar to each other – Separation: elements from different clusters have low similarity to each other Hierarchical clustering Different experimental conditions/time points Genes k-means clustering Genes related with ‘casein’ in mammary gland tissues Lactation Linking gene expression data with morphological information stage(A, pregnancy) :gene_id(A, g1), gene_id(A, g2),…, has_ducts(A, medium), Fat_pad(A, medium). 47% Molecular Function 38% Biological Process 15% Cellular Component Challenges of KDBD Goal “All possible data” (in the universe) Hypotheses Current Data Learning in “Dirty” Biological Databases • • • • • Experimental errors Wrong interpretation by biologists Human error during annotation process Non standardised techniques Biased data Expressive Capacity Hypothesis for Glutathione reductase (GR) Family Class(‘GR’,A):protein(A,B,C,D), Sequence(B,GxG(x)2G(x)16-19[DE]), Structure(C,bbasandwich), Has_seq(strand1_helix1_strand2,B,C), Function(D,oxidoreductases). If the protein has sequence motif GxG(x)2G(x)16-19[DE] in β1-α1-β2 of the 3-layer β-β-α sandwich structure and carries out oxidoreductases reaction then it is GR family. GxG(x)2G(x)16-19[DE] Single Vs Multiple Methods • Advantage - compliment each other • Increase expressive power - discover useful & understandable knowledge • Difficult to combine - lack of coherence Open Question? “All data” Training Set Current data (continues to expand) Hypotheses Conclusion ggggctacgg ccgaccttcg gggtggcaac aagctctcgg ttgatgcaaa ggggtggggc cctgcgccgc tggaagatga ccgacaccga gattggagtt ttcgcgcccc gccggcctat aaaagcggcc gccgcggctc cgtgccgttg tgctgcttcgcgcccgtcgc ctccgccatg gctcccagga agttcttcgt acggcgacaa gaagagcttg ggcgagctca tccacacgct gaatggcgcc ggtggtttgc ggagcccctt caatctacct tgattttgcc cgccagaagc gcagcacaaa actgttacaa ggtaccgaag ggtgctttca caggagagat Acknowledgements • Gilleain Torrance, Mallika Veeramalai, Olivier Sand, Ali Al-Shahib (Bioinformatics Research Centre, University of Glasgow) • David Westhead, (EBI), Ioannis Michalopoulos, Leeds University • Janet Thornton, UCL, Birkbeck, EBI • Lorenz Wernisch, Birkbeck • Juris Viksna, University of Latvia • Inge Jonassen, Ingvar Eidhammer, U.Bergen • Alvis Brazma, EBI Bioinformatics Research Centre • Provides an environment for collaborative interdisciplinary research in Bioinformatics. • Hosts researchers from – Department of Computing Science – Institute of Biomedical and Life Sciences. • Physically located in the Institute of Biomedical and Life Sciences (Spring 2003) • Strong links with – Sir Henry Welcome Functional Genomics Facility. – Statistical Bioinformatics – Mathematical Biology • Outreach programme (visitors etc) The Scottish Bioinformatics Forum (SBF) • Network of Bioinformatics researchers and industries in Scotland • A vehicle for developing Scotland as a Centre of Bioinformatics Excellence • Nodes in Glasgow, Edinburgh, Dundee, Aberdeen, ... • Promoting collaborative research • Development of a Bioinformatics educational programme • www.sbforum.org, sbforum-general@sbforum.org Contacts {actan,drg}@brc.dcs.gla.ac.uk Bioinformatics Research Centre Department of Computing Science University of Glasgow http://www.brc.dcs.gla.ac.uk