Bioinformatics & Computational Biology Podcast for Frontiers in Biology - ISU 7/13/06 Drena Dobbs Genetics, Development and Cell Biology Bioinformatics & Computational Biology Iowa State University Thanks to Mark Gerstein (Yale) & Eric Green (NIH) for many borrowed & modified PPTs What is Bioinformatics? (& What is Computational Biology?) Wikipedia: • Bioinformatics & computational biology involve the use of techniques from mathematics, informatics, statistics, and computer science (& engineering) to solve biological problems What is Bioinformatics? (& What is Computational Biology?) Gerstein: • (Molecular) Bioinformatics is conceptualizing biology in terms of molecules & applying “informatics” techniques - derived from disciplines such as mathematics, computer science, and statistics - to organize and understand information associated with these molecules, on a large scale Modified from Mark Gerstein What is the Information? Biological Sequences, Structures, Processes Central Dogma of Molecular Biology Central Paradigm for Bioinformatics • DNA sequence -> RNA -> Protein -> Phenotype • Genomic (DNA) Sequence • Molecules Sequence, Structure, Function • Processes -> mRNA& other RNA sequence -> Protein sequence -> RNA & Protein Structure -> RNA & Protein Function -> Phenotype • Large Amounts of Information Mechanism, Specificity, Regulation Modified from Mark Gerstein idea from D Brutlag, Stanford, graphics from S Strobel) Standardized Statistical Explosion of "Omes" & "Omics!" Genome, Transcriptome, Proteome * Note: the set of specific RNAs or proteins expressed varies greatly in different cells and tissues -- and critically depends on the age, developmental stage, disease state, etc. of the organism • Genome - the complete collection of DNA (genes and "non-genes") of an organism • Transcriptome - the complete collection of RNAs (mRNAs & others) expressed in an organism * • Proteome - the complete collection of proteins expressed in an organism * Molecular Biology Information: DNA & RNA Sequences Functions: • • • • Genetic material Information transfer (mRNA) Protein synthesis (tRNA/mRNA) Catalytic & regulatory activities (some very new!) DNA sequence: atggcaattaaaattggtatcaatggttttggtcgtat gcacaacaccgtgatgacattgaagttgtaggtattaa atggcttatatgttgaaatatgattcaactcacggtcg aaagatggtaacttagtggttaatggtaaaactatccg Gcaaacttaaactggggtgcaatcggtgttgatatcgctttaactg atgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagtt Information: RNA sequence has "U" instead of "T" • 4 letter alphabet (DNA nucleotides: AGCT) • ~ 1,000 base pairs in a small gene • ~ 3 X 109 bp in a genome (human) Modified from Mark Gerstein • • • • Where are the genes? Which DNA sequences encode mRNA? Which DNA sequences are "junk"? Which RNA sequences encode protein? Molecular Biology Information: Protein Sequences Functions: Most cellular functions are performed or facilitated by proteins • Biocatalysis • • • • Protein sequences: d1dhfa_ Cofactor transport/storage LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTT Mechanical motion/support d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTS Immune protection d4dfra_ ISLIAALAVDRVIGMENAMPWNRegulation of growth and differentiation Information: • 20 letter alphabet (amino acids) ACDEFGHIKLMNPQRSTVWY but not BJOUXZ • ~ 300 aa in an average protein (in bacteria) • ~ 3 X 106 known protein sequences Modified from Mark Gerstein LPADLAWFKRNTL d3dfr__ TAFLWAQDRDGLIGKDGHLPWHLPDDLHYFRAQTV • What is this protein? • Which amino acids are most important -- for folding, activity, interaction with other proteins? • Which sequence variations are harmful (or, beneficial)? Molecular Biology Information: Macromolecular Structures DNA/RNA/Protein Structures • How does a protein (or RNA) sequence fold into an active 3-dimensional structure? • Can we predict structure from sequence? • Can we predict function from structure (or perhaps, from sequence alone?) Modified from Mark Gerstein We don't yet understand the protein folding code - but we try to engineer proteins anyway! Modified from Mark Gerstein Molecular Biology Information: Biological Processes Functional Genomics • How do patterns of gene expression determine phenotype? • Which genes and proteins are required for differentiation during during development? • How do proteins interact in biological networks? • Which genes and pathways have been most highly conserved during evolution? On a Large Scale? Whole Genome Sequencing Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading. -- G A Pekso, Nature 401: 115-116 (1999) Modified from Mark Gerstein Next Step after the Sequence? Understanding Gene Function on a Genomic Scale • Expression Analysis • Structural Genomics • Protein Interactions • Pathway Analysis • Systems Biology Evolutionary Implications of: • Introns & Exons • Intergenic Regions as "Gene Graveyard" Modified from Mark Gerstein Gene Expression Data: the Transcriptome MicroArray Data Yeast Expression Data: • Levels for all 6,000 genes! • Experiments to investigate how genes respond to changes in environment or how patterns of expression change in normal vs cancerous tissue Modified from Mark Gerstein (courtesy of J Hager) ISU's Biotechnology Facilities include state-of-the-art Microarray & Proteomics instrumentation Other Whole-Genome Experiments Systematic Knockouts: Make "knockout" (null) mutations in every gene - one at a time - and analyze the resulting phenotypes! For yeast: 6,000 KO mutants! Modified from Mark Gerstein 2-hybrid Experiments: For each (and every) protein, identify every other protein with which it interacts! For yeast: 6000 x 6000 / 2 ~ 18M interactions!! Molecular Biology Information: Integrating Data • Understanding the function of genomes requires integration of many diverse and complex types of information: Metabolic pathways Regulatory networks Whole organism physiology Evolution, phylogeny Environment, ecology Literature (MEDLINE) Modified from Mark Gerstein Storing & Analyzing Large-scale Information: Exponential Growth of Data Matched by Development of Computer Technology CPU vs Disk & Net • Both the increase in computer speed and the ability to store large amounts of information on computers have been crucial • Improved computing resources have been a driving force in Bioinformatics ISU's supercomputer "CyBlue" is among 100 most powerful in the world Modified from Mark Gerstein (Internet picture adaptedfrom D Brutlag, Stanford) Bioinformatics is born! & more Bioinformaticists are needed! (Internet picture adapted from D Brutlag, Stanford) Modified from Mark Gerstein (courtesy of Finn Drablos) from Mark Gerstein Weber Cartoon “Informatics” techniques in Bioinformatics • Databases Building, Querying Object-oriented DB • String Comparison Text search Alignment Significance statistics • Finding Patterns Machine Learning Data Mining Statistics Linguistics • Geometry Robotics Graphics (Surfaces, Volumes) Comparison & 3D Matching • Simulation & Modeling Newtonian Mechanics Electrostatics Numerical Algorithms Simulation Network modeling Challenges in Organizing Information: Redundancy and Multiplicity • Different sequences can have the same structure • Organism has many similar genes • Single gene may have multiple functions • Genes and proteins function in genetic and regulatory pathways • How do we organize all this information so that we can make sense of it? Integrative Genomics: genes >< structures <> functions <> pathways <> expression levels <>regulatory systems <> …. Modified from Mark Gerstein Molecular Parts = Conserved Domains Modified from Mark Gerstein "Parts List" approach to bike maintenance: How many roles can these play? How flexible and adaptable are they mechanically? What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts - types of parts (nuts & washers)? Where are the parts located? Modified from Mark Gerstein World of structures is also finite, providing a valuable simplification (human) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 … ~30,000 genes ~2,000 folds (T. pallidum) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … Global Surveys of a Finite Set of Parts from Many Perspectives Same logic for pathways, functions, sequence families, blocks, motifs.... Modified from Mark Gerstein Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop.... ~2,000 genes So, this is Bioinformatics What is it good for? Application I: Designing Drugs • Understanding how proteins bind other molecules • Docking & structure modeling • Designing inhibitors Modified from Mark Gerstein Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center). Application II: Finding homologs Modified from Mark Gerstein Finding WHAT? Homologs - "same genes" in different organisms • Human vs. Mouse vs. Yeast Much easier to do experiments on yeast! Best Sequence Similarity Matches to Date Between Positionally Cloned Human Genes and S. cerevisiae Proteins Human Disease MIM # Human Gene GenBank BLASTX Acc# for P-value Human cDNA Yeast Gene GenBank Yeast Gene Acc# for Description Yeast cDNA Hereditary Non-polyposis Colon Cancer Hereditary Non-polyposis Colon Cancer Cystic Fibrosis Wilson Disease Glycerol Kinase Deficiency Bloom Syndrome Adrenoleukodystrophy, X-linked Ataxia Telangiectasia Amyotrophic Lateral Sclerosis Myotonic Dystrophy Lowe Syndrome Neurofibromatosis, Type 1 120436 120436 219700 277900 307030 210900 300100 208900 105400 160900 309000 162200 MSH2 MLH1 CFTR WND GK BLM ALD ATM SOD1 DM OCRL NF1 U03911 U07418 M28668 U11700 L13943 U39817 Z21876 U26455 K00065 L19268 M88162 M89914 9.2e-261 6.3e-196 1.3e-167 5.9e-161 1.8e-129 2.6e-119 3.4e-107 2.8e-90 2.0e-58 5.4e-53 1.2e-47 2.0e-46 MSH2 MLH1 YCF1 CCC2 GUT1 SGS1 PXA1 TEL1 SOD1 YPK1 YIL002C IRA2 M84170 U07187 L35237 L36317 X69049 U22341 U17065 U31331 J03279 M21307 Z47047 M33779 DNA repair protein DNA repair protein Metal resistance protein Probable copper transporter Glycerol kinase Helicase Peroxisomal ABC transporter PI3 kinase Superoxide dismutase Serine/threonine protein kinase Putative IPP-5-phosphatase Inhibitory regulator protein Choroideremia Diastrophic Dysplasia Lissencephaly Thomsen Disease Wilms Tumor Achondroplasia Menkes Syndrome 303100 222600 247200 160800 194070 100800 309400 CHM DTD LIS1 CLC1 WT1 FGFR3 MNK X78121 U14528 L13385 Z25884 X51630 M58051 X69208 2.1e-42 7.2e-38 1.7e-34 7.9e-31 1.1e-20 2.0e-18 2.1e-17 GDI1 SUL1 MET30 GEF1 FZF1 IPL1 CCC2 S69371 X82013 L26505 Z23117 X67787 U07163 L36317 GDP dissociation inhibitor Sulfate permease Methionine metabolism Voltage-gated chloride channel Sulphite resistance protein Serine/threoinine protein kinase Probable copper transporter Modified from Mark Gerstein Application III: Genome/Transcriptome/Proteome Characterization & Comparison Databases, statistics • Occurrence of specific genes or features in a genome How many kinases in yeast? • Compare Tissues Which proteins are expressed in cancer vs normal tissues? • Diagnostic tools • Drug target discovery Modified from Mark Gerstein Building “Designer” Zinc Finger DNA-binding Proteins J Sander, Fengli Fu, J Townsend, R Winfrey D Wright, K Joung, D Dobbs, D Voytas Identifying "Missing" Components of Signal Transduction Pathways Phil Becraft, GDCB Antony Chettoor Drena Dobbs, GDCB Jae-Hyung Lee Kai-Ming Ho, Physics Zhong Gao Yungok Ihm Haibo Cao Cai-zhuang Wang Designing New HIV Therapies Susan Carpenter, VMPM Sijun Liu Wendy Wood Drena Dobbs, GDCB Jae-Hyung Lee Kai-Ming Ho, Physics & Astronomy Yungok Ihm Haibo Cao Cai-zhuang Wang Amy Andreotti,BBMB Bruce Fulton, NMR Facility Vasant Honavar, Com S Changhui Yan Predicting Protein-Protein Interactions from Amino Acid Sequence Vasant Honavar, Com S Changhui Yan Drena Dobbs, GDCB Jae-Hyung Lee Kai-Ming Ho, Physics Robert Jernigan, BBMB