From genotype to phenotype: Bioinformatic tools for functional genomics Steve Oliver Professor of Genomics School of Biological Sciences University of Manchester http://www.bioinf.man.ac.uk Infoberg Sequence data Functional data Functional Genomics Level of Analysis Definition Status Method of Analysis Genome Complete set of genes of an organism or its organelles. Context-independent (modifications to the yeast genome may be made with exquisite precision. Systematic DNA sequencing. Transcriptome Complete set of mRNA molecules present in a cell, tissue or organ. Context-dependent (the complement of mRNAs varies with changes in physiology, development or pathology. Hybridisation arrays. SAGE High-throughput Northern analysis. Proteome Complete set of protein molecules present in a cell, tissue or organ. Context-dependent. 2-D gel electrophoresis. Peptide mass fingerprinting. Two-hybrid analysis. Metabolome Complete set of metabolites (low molecular weight intermediates) present in a cell, tissue or organ. Context-dependent. Infra-red spectroscopy. Mass spectometry. Nuclear magnetic resonance spectometry. GENOME TRANSCRIPTOME PROTEOME METABOLOME 4.0 4.5 5.0 5.5 6.0 6.5 Aberdeen PRF1: S. cerevisiae 2D map ADE6 + 150 100 CDC48 + 90 ABP1 + 80 SSA2 SSA1 + + HSP60 PDR13 + 70 60 PUB1+ 50 HIS4 + ADE5,7 + SSE1 + SSC1 VMA1 + + SSB1 + WTM1+ HXK2 + + VMA2 HXK1 SAM1 ATP2 + + + + LYS9 TIF3 + SGT2 + ADO1 + TPM1 + FBA1 + + SPE3 Ykl056c + FBA1 EGD2 20 PGK1? + OYE2 + CYS3 + ADH1 + PSA1 + ILV5 + + +URA1 + ADH1 ENO2 + + PGK1 + PDC1 + ASC1 TDH3 + TPI1 TPI1 + + PST2 + MGE1 TSA1 AHP1 YHB1 + ADK1 + + RIB3 + + + + ENO1 + + MET17 + + HSP26 FBA1 + + + + ENO2 EFB1 + + PDC1 + + RHR2 CYS4 + + SES1 ENO2 + + VMA4 ENO2 + + Ylr301w SEC53 + + RPS0A + RPS0B + GLK1, + ARO8 GDH1 + CDC19 + + Yfr044c IPP1 + + BMH1 + HYP2 + PDC1 + FBA1 BMH2 + 30 ALD6 + PAB1 + + ASN2 + + PDB1 + CLC1,BGL2 + MET6 + STI1 + ACT1 + + ARG1 SAM2 + 40 LEU1 SOD1 + + BNA1 TDH3 + + COF1 + + EGD1 PDC1 + FPR1 + NTF2 + 10 PFY1 + ENO2 + RPS21 + RIB4 + RPL22A + CPH1 + Peptide mass fingerprinting denature KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRC LPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMS ITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHF DASV digest (trypsin) KETAAAK m1 FER QHMDSSTSAASSSNYCNQMMK m2 m3 CLPVNTFVHESLADVQAVCSQK NVACK m7 ETGSSK m10 SR m4 NLTK m5 m9 YPNCAYKTTQANK HIIVACEGNPYVPVHFDASV m11 m12 abundance mass spectrometry m7 m10 m1 mass m6 NGQTNCYQSYSTMSITDCR m8 m11 DR m12 m9 PROBLEMS WITH ‘CLASSICAL’ PROTEOME ANALYSIS: 1. Not comprehensive 2. Not high-throughput 3. Destroys protein-protein interactions that provide important clues to function Number of (protein) database matches 450 400 350 300 250 200 C. elegans 150 100 S.cerevisiae 50 0 1000 E.coli H.influenzae 1200 1400 1600 Peptide mass (Da) 1800 2000 Just Enough Diagnostic Information UMIST Univ. Manchester Kwushant Sidhu Polkit Sangvanich Tony Sullivan Olaf Wolkenhauer Simon Gaskell Simon Hubbard Francesco Brancia Steve Oliver Provide limited sequence information by: 1. Identification of N-terminal amino acid by PTC derivatisation 2. Use guanidination to identify C-terminus, determine lysine content, and improve signal response 3. Specifically fragment next to Asp residues using MALDI-QToF MS Initial set of search peptides and associated information Search database, compile protein “hit list” with matching peptides Top-scoring protein is matched. Remove corresponding peptides from search list If all initial search peptides masses are matched, stop, else continue searching S. cerevisiae Yeast22proteins proteins 100 100 90 90 % unambiguous identification % unambiguous identification S. cerevisiae 1 protein Yeast 1 protein 80 80 standard 70 guanidination 60 PTC (500) 50 PTC (50) 40 Asp-frag 30 (All) Asp-frag 70 60 50 40 30 20 10 standard guanidination PTC (500) PTC (50) Asp-frag Asp-frag (All) 20 10 0 0 1 2 2 4 6 C. elegans 2 proteins C. elegans 1 protein 100 100 90 90 % unambiguous identification % unambiguous identification 4 total number of search peptides total number of search peptides 80 70 60 50 40 30 20 80 70 standard standard guanidination 60 guanidination PTC (500) PTC (500) PTC (50) PTC (50) 50 40 Asp-frag Asp-frag 30 (All) Asp-frag Asp-frag (All) 20 10 10 0 0 1 2 total num ber of search peptides 4 2 4 6 total number of search peptides 90 80 70 60 50 % unambiguous 40 identification 30 20 10 Asp-promoted daughter ions 0 N-terminal residue Peptide masses only (5ppm) 4 6 8 10 number of search peptides Identification in a mixture of 3 S. cerevisiae proteins Genome Information Management System (GIMS) • Norman Paton, Carole Goble, Mike Cornell, Paul Kirby - Dept. of Computer Sciences. University of Manchester. • Steve Oliver, Andy Hayes, Andy Brass - School of Biological Sciences. University of Manchester. Object Data Model for GIMS Genome 1 contains * Chromosome 1 contains 0..1 next prior 0..1 * Chromosome * Fragment {Abstract} 0..1 contains Gene is a Transcribed contains 0..1 next prior Non Transcribed 1 * is a Transcribed Fragments {Abstract} 0..1 * composed of is a mRNA snRNA Spliced Transcript Component tRNA CEN Intron Promoter rRNA 1 translates to1 ORF 1 1 PRIMARY POLYPEPTIDE 1 is a is a is a Spliced * Transcript composes * Chromosomal Element Regulators FUNCTIONAL PROTEIN Terminator ORI TEL Data in GIMS Data type Data source DNA sequences, chromosome locations of coding regions, e.g. ORFs, tRNAs, centromeres, telomeres etc. MIPS Predicted protein sequences, pI, mol weight, number of transmembrane regions. MIPs Protein attributes (e.g. cellular location, function, protein class, Prosite motifs, phenotype). MIPS Protein interaction data (yeast two-hybrid). Protein interaction data (genetic interactions). MIPS, Uetz et al. (2000), Ito et al (2001) MIPS Data in GIMS (2) Data type Data source Protein interaction data (TAP & HMS-PCI complexes). Gavin et al. (2002) Ho et al. (2002) Metabolic data (reactions, compounds and enzymes). L-compound, L-enzyme Post-translational modifications. YPD Transcription factor. SCPD Transcriptome data Stanford Microarray Database, Rosetta Inpharmatics, Inc and University of Manchester Evaluating protein-interaction data Integrating complex data with yeast two-hybrid data Complex consists of six proteins A, B, C, D, E, F B F A E In a yeast two-hybrid experiment, A A interacts with another protein Is B, C, D, E or F? C D Percentages of protein pairs sharing the same cellular location % protein interactions compatible with subcellular location MIPS complexes 99.0 HMS-PCI, complexes 47.6 TAP complexes 55.3 Y2H 48.3 interactions Randomly generated complexes 33.7 Large-scale interaction data and the distribution of interactions according to functional categories. Quantitative comparison of interaction datasets. m MyGrid Personalised extensible environments for data-intensive in silico experiments in biology Professor Carole Goble, University of Manchester Dr Alan Robinson, EBI Approach Applications Toolkits Metadata Personalisation Interoperation layer Context mgt Process mgt Communication fabric Data mgt Robot Scientist Project Aberystwyth York/Imperial Ross King Phil Reissner Douglas Kell Stephen Muggleton Chris Bryant Manchester Steve Oliver The Robot Scientist Project Aim: build a physical implementation of a scientific active learning system and apply to functional genomics Test problem: Genetic control of aromatic amino acid biosynthesis in yeast 1. 2. 3. 4. Devise experiments to select between hypotheses Direct a robot to perform experiments Automatically analyse the experimental results Revise set of hyphotheses Background Background Knowledge Knowledge Analysis Analysis Learning Learning Engine Engine Consistent Hypotheses New New Biological Biological Knowledge Knowledge Experiment Experiment Selection Selection Experimental Experimental Results Results Robot Analysis of Results Cost of the chemicals consumed The cost of the chemicals consumed in converging upon a hypothesis with an accuracy in the range 46 – 88% was reduced if trials were selected by ASE-Progol rather than if they were sampled at random. To reach an accuracy in the range 46 – 80%, ASE-Progol incurs five orders of magnitude less costs than random sampling. Analysis of Results (cont’d) Duration of experiment ASE-Progol requires less time to converge upon a hypothesis with an accuracy in the range 74 – 87% than if trials are sampled at random or selected using the naïve strategy. To reach an accuracy of 80%, takes: ASE-Progol Random sampling Naïve strategy = 4 days = 6 days = 10 days CONCLUSIONS 1. Take full advantage of complete genome sequences 2. Promote close cooperation between experimentalists & bioinformaticians/computer scientists 3. Require integration of data from different ‘omic levels to mine reliable biological information 4. Exploit machine-learning techniques in the design, execution, and interpretation of experiments in functional genomics