#2 - Biological Databases 8/22/07 BCB 444/544 BCB 444/544 - Website Finish: Lecture 1- What is Bioinformatics? http://bindr.gdcb.iastate.edu/bcb544 • Updated Syllabus Hyp erlin k • Lecture & Lab Schedules (with Homework Assignments) • Lecture PPTs & PDFs • Lab Exercises • Practice Exams • Grading Policy • Project Guidelines, etc. • Links Lecture 2 Biological Databases & ISU Resources #2_Aug22 BCB 444/544 F07 ISU Dobbs #2 - Biological Databases • Check regularly for updates! 8/22/07 1 BCB 444/544 F07 ISU Meets in 1304 MBB every week EXCEPT this week: 8/22/07 3 BCB 444/544 F07 ISU Required Reading Wed Aug 22 - for Lecture #2 • Xiong Textbook: 4 http://www.dnai.org/c/index.html • Chp 1 - Introduction • Chp 2 - Biological Databases A tutorial on genomic sequencing, gene structure, genes prediction Thurs Aug 23 - for Lab #1: Howard Hughes Medical Institute (HHMI) Cold Spring Harbor Laboratory (CSHL) • Literature Resources for Bioinformatics Andrea Dinkelman, see Lab Schedule for URL 1. 2. 3. Fri Aug 24 • Genomics & Its Impact on Science & Society: Genomics & Human Genome Project Primer see Lecture Schedule for URL BCB 444/544 Fall 07 Dobbs Dobbs #2 - Biological Databases Assignment #2 (& for Fun): DNA Interactive "Genomes" (must read before lecture) Dobbs #2 - Biological Databases 8/22/07 1- Complete HW1_Aug20 for Drena Current schedule: Thurs 1-3 PM Conflicts? See Drena BCB 444/544 F07 ISU 2 Due: Today - Wed, Aug 22 1st Lab meets in Library Rm 32 Dobbs #2 - Biological Databases 8/22/07 Assignment #1: Tell us about you BCB 444/544 - Computer Lab BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 5 Take the Tour Read about the Project Do some Genome Mining with: Nothing to turn in - just do it! BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 6 1 #2 - Biological Databases 8/22/07 #1- What is Bioinformatics? 1st Draft Human Genome: "Finished" in 2001 (cont.) Xiong: Chp 1 1 Introduction What Is Bioinformatics? Goal Scope Applications Limitations New Themes Further Reading BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 7 Modified from Eric Green BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 8 8/22/07 10 Public Sequencing: International Consortium Human Genome Sequencing Two approaches: • Public (government) - International Consortium (mainly 6 countries, NIH-funded in US) • Hierarchical cloning & BAC-to-BAC sequencing • Map-based assembly • Private (industry) - Celera, Craig Venter, CEO • Whole genome random "shotgun" sequencing • Computational assembly (took advantage of public maps & sequences, too) Guess which human genome they sequenced? How many genes? ~ Craig's 20,000 (Science, May 2007) BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 9 Modified from Eric Green BCB 444/544 F07 ISU Dobbs #2 - Biological Databases "Complete" Human Genome Sequence: What next? Comparison of Sequenced Genome Sizes Plants? Many have much larger genomes than human! Modified from Eric Green BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs 8/22/07 11 from Eric Green BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 12 2 #2 - Biological Databases 8/22/07 How can we begin to understand the complete Human Genome Sequence? Next Step after the Complete Sequence? Understanding Gene Function on a Genomic Scale • Expression Analysis • Structural Genomics • Protein Interactions • Network Analysis • Systems Biology Evolutionary Implications of: • Intergenic Regions as "Gene Graveyard" • Introns & Exons Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 13 from Eric Green Comparative Genomics: Compare entire genomes BCB 444/544 F07 ISU from Eric Green Dobbs #2 - Biological Databases BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 14 Comparing Genomes: Identifying functional elements 8/22/07 15 from Eric Green BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 16 Other "Omes" Proteome, Metabolome, Glycome, etc. Gene Expression Data: the Transcriptome MicroArray Data ISU has state-of-the-art Proteomics Instrumentation Yeast Expression Data: • Levels for all 6,000 genes! • Investigate how all genes respond to changes in environment or, in humans, e.g., how patterns of RNA expression change in normal vs cancerous tissue Modified from Mark Gerstein BCB 444/544 F07 ISU ISU's Biotechnology Facilities include state-of-the-art Microarray Instrumentation Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs 8/22/07 17 ISU's has state-of-the-art Metabolomics Instrumentation BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 18 3 #2 - Biological Databases 8/22/07 Molecular Biology Information: Integrating Data How are "Omes" related? Understanding the function of genomes requires integration of many diverse and complex types of information: • • • • • • Systems Biology seeks to integrate all of these to explain the complex behaviors of whole systems (cells, organisms, ecosystems) BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 19 Metabolic pathways Regulatory networks Whole organism physiology Evolution, phylogeny Environment, ecology Literature (MEDLINE) BCB 444/544 F07 ISU Modified from Mark Gerstein Dobbs #2 - Biological Databases 8/22/07 20 Storing & Analyzing Geonomic Information: Other Genome-Scale Experiments Exponential Growth of Data Coupled with Development of Fast Computer Technology • Increases in computer speed & starage capacity have been dramatic Systematic Knockouts: 2-hybrid Experiments: Make "knockout" (null) mutations in every gene - one at a time - and analyze the resulting phenotypes! For each (and every) protein, identify every other protein with which it interacts! For yeast: 6,000 KO mutants! For yeast: 6000 x 6000 / 2 ~ 18M interactions!! Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases • Improved computing resources & more efficient algorithms have been driving forces in Bioinformatics & Computational Biology ISU's supercomputer "CyBlue" is among 100 most powerful computers in the world! 8/22/07 21 Modified from Mark Gerstein Bioinformatics is born! & more Bioinformaticists are needed! BCB 444/544 F07 ISU • Robotics • Graphics (surfaces, volumes) • Comparison & 3D matching • String Comparison • Text search • Alignment • Significance statistics BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs • Simulation & Modeling • Patterns Finding • • • • 8/22/07 23 22 • Computational Geometry Building & querying objectoriented & relational DBs Modified from Mark Gerstein 8/22/07 “Informatics” techniques used in Bioinformatics • Databases (Internet picture adapted from D Brutlag, Stanford) Dobbs #2 - Biological Databases Machine Learning Data Mining Statistics Linguistics BCB 444/544 F07 ISU • • • • • • Newtonian mechanics Electrostatics Numerical algorithms Simulation Network modeling Population modeling Dobbs #2 - Biological Databases 8/22/07 24 4 #2 - Biological Databases 8/22/07 One Strategy: Molecular Parts = Conserved Domains Challenges in Organizing Information: Redundancy and Multiplicity • Different protein sequences can assume the same 3-D structure • Organisms have many similar genes with redundant functions • A single gene may have several different functions • Genes & proteins function in complex genetic & regulatory pathways • How do we organize all this information so that we can make sense of it? Functional Genomics & Systems Biology: sequences <> motifs <> genes <> RNAs <> proteins <> structures <> functions <> expression levels <> pathways <> regulatory networks <> functional systems Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 25 Modified from Mark Gerstein "Parts List" approach to bike maintenance: Where are the parts located? BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 26 World of macromolecular structures is also finite, providing a valuable simplification H. sapiens ~ 20,000 genes ~ 2,000 folds T. pallidum Global surveys of a finite set of parts from different perspectives Which are the common parts (bolt, nut,washer, spring, bearing)? Which are unique parts (cogs, levers)? Same logic for pathways, functions, sequence families, blocks, motifs.... How flexible and adaptable are parts mechanically? Modified from Mark Gerstein BCB 444/544 F07 ISU ~ 2,000 genes Dobbs #2 - Biological Databases 8/22/07 27 BUT, what actually happens inside cells or within whole organisms is very complex providing a challenging complication ! Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 28 So, having a list of parts is not enough! BIG QUESTION? How do parts work together to form a functional system? Exploring the Virtual Cell at ISU Virtual Cell projects elsewhere... SYSTEMS BIOLOGY NCBI's Bookshelf - a great resource! What is a system? Macromolecular complex, pathway, network, cell, tissue, organism, ecosystem… BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs 8/22/07 29 BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 30 5 #2 - Biological Databases 8/22/07 Designing drugs So, this is Bioinformatics • Understanding how proteins bind other molecules • Structural modeling & ligand docking • Designing inhibitors or modulators of key proteins What is it good for? Just a few examples… Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center). BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 31 Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 32 Finding WHAT? Finding homologs of "new" human genes Homologs - "same genes" in different organisms • Human vs Mouse vs Yeast • Much easier to do experiments on yeast to determine function • Often, function of an ortholog in at least one organism is known Best Sequence Similarity Matches to Date Between Positionally Cloned Human Genes and S. cerevisiae Proteins Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 33 Human Disease MIM # Human Gene GenBank BLASTX Acc# for P-value Human cDNA Yeast Gene GenBank Yeast Gene Acc# for Description Yeast cDNA Hereditary Non-polyposis Colon Cancer Hereditary Non-polyposis Colon Cancer Cystic Fibrosis Wilson Disease Glycerol Kinase Deficiency Bloom Syndrome Adrenoleukodystrophy, X-linked Ataxia Telangiectasia Amyotrophic Lateral Sclerosis Myotonic Dystrophy Lowe Syndrome Neurofibromatosis, Type 1 120436 120436 219700 277900 307030 210900 300100 208900 105400 160900 309000 162200 MSH2 MLH1 CFTR WND GK BLM ALD ATM SOD1 DM OCRL NF1 U03911 U07418 M28668 U11700 L13943 U39817 Z21876 U26455 K00065 L19268 M88162 M89914 9.2e-261 6.3e-196 1.3e-167 5.9e-161 1.8e-129 2.6e-119 3.4e-107 2.8e-90 2.0e-58 5.4e-53 1.2e-47 2.0e-46 MSH2 MLH1 YCF1 CCC2 GUT1 SGS1 PXA1 TEL1 SOD1 YPK1 YIL002C IRA2 M84170 U07187 L35237 L36317 X69049 U22341 U17065 U31331 J03279 M21307 Z47047 M33779 DNA repair protein DNA repair protein Metal resistance protein Probable copper transporter Glycerol kinase Helicase Peroxisomal ABC transporter PI3 kinase Superoxide dismutase Serine/threonine protein kinase Putative IPP-5-phosphatase Inhibitory regulator protein Choroideremia Diastrophic Dysplasia Lissencephaly Thomsen Disease Wilms Tumor Achondroplasia Menkes Syndrome 303100 222600 247200 160800 194070 100800 309400 CHM DTD LIS1 CLC1 WT1 FGFR3 MNK X78121 U14528 L13385 Z25884 X51630 M58051 X69208 2.1e-42 7.2e-38 1.7e-34 7.9e-31 1.1e-20 2.0e-18 2.1e-17 GDI1 SUL1 MET30 GEF1 FZF1 IPL1 CCC2 S69371 X82013 L26505 Z23117 X67787 U07163 L36317 GDP dissociation inhibitor Sulfate permease Methionine metabolism Voltage-gated chloride channel Sulphite resistance protein Serine/threoinine protein kinase Probable copper transporter Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 34 Molecular Recognition: Comparative Genomics: Genome/Transcriptome/Proteome/Metabolome Analyzing & Predicting Macromolecular Interfaces (in DNA, RNA & protein complexes) Drena Dobbs, GDCB Jae-Hyung Lee Michael Terribilini Jeff Sander Pete Zaback Databases, statistics • Occurrence of a specific genes or features in a genome • How many kinases in yeast? Vasant Honavar, Com S Feihong Wu Cornelia Caragea Fadi Towfic Jivo Sinapov • Compare Tissues • Which proteins are expressed in cancer vs normal tissues? • Diagnostic tools • Drug target discovery Robert Jernigan, BBMB Taner Sen Andrzej Kloczkowski Kai-Ming Ho, Physics Modified from Mark Gerstein BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs 8/22/07 35 BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 36 6 #2 - Biological Databases 8/22/07 Structure & Function of Human Telomerase: Designing Zinc Finger DNA-binding Proteins to Recognize Specific Sites in Genomic DNA Predicting structure & functional sites in a clinically important but "recalcitrant" RNP Drena Dobbs, GDCB Jeff Sander Pete Zaback Cell Biologist: Biochemist: Imagined structure: Dan Voytas, GDCB Fengli Fu Les Miller, ComS Vasant Honavar, ComS Keith Joung, Harvard www.intl-pag.org/ www.chemicon.com Lingner et al (1997) Science 276: 561-567. How would a systems biologist study telomerase? BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 37 SUMMARY: #1- What is Bioinformatics? BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 38 8/22/07 40 #2- Biological Databases Xiong: Chp 2 2 Introduction to Biological Databases What Is a Database? Types of Databases Biological Databases Pitfalls of Biological Databases Information Retrieval from Biological Databases Summary Further Reading BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 39 What is a Database? BCB 444/544 F07 ISU Dobbs #2 - Biological Databases Types of Databases 3 Major types of electronic databases: Duh!! 1- Flat files - simple text files • no organization to facilitate retrieval 2- Relational - data organized as tables ("relations") • shared features among tables allows rapid search OK: skip we'll skip that! 3- Object-oriented - data organized as "objects" • objects associated hierarchically BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs 8/22/07 41 BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 42 7 #2 - Biological Databases 8/22/07 Types of Biological Databases Biological Databases 1- Primary Currently - all 3 types, but MANY flat files • "simple" archives of sequences, structures, images, etc. • raw data, minimal annotations, not always well curated! What are goals of biological databases? 2- Secondary 1- Information retrieval • enhanced with more complete annotation of sequences, structures, images, etc. 2- Knowledge discovery • usually curated! 3- Specialized Important issue: • focused on a particular research interest or organism Interconnectivity BCB 444/544 F07 ISU Dobbs #2 - Biological Databases • usually - not always - highly curated 8/22/07 43 BCB 444/544 F07 ISU Examples of Biological Databases Dobbs #2 - Biological Databases 8/22/07 44 8/22/07 46 Examples of Biological Databases 1- Primary 2- Secondary • DNA sequences • Protein sequences • GenBank - US • Swiss-Prot, TreEMBL, PIR • European Molecular Biology Lab - EMBL • these recently combined into UniProt • DNA Data Bank of Japan - DDBI 3- Specialized • Structures (Protein, DNA, RNA) • Species-specific (or "taxonomic" specific) • PDB - Protein Data Bank • Flybase, WormBase, AceDB, PlantDB • NDB - Nucleic Acid Databank BCB 444/544 F07 ISU Dobbs #2 - Biological Databases • Molecule-specific,disease-specific 8/22/07 45 BCB 444/544 F07 ISU Dobbs #2 - Biological Databases Information Retrieval from Biological Databases Pitfalls of Biological Databases • Errors! & • Lack of documentation re: quality or reliability of data • Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) • Redundancy • Inconsistency • Incompatibility (format, terminology, data types, etc.) 2 most popular retrieval systems: • ENTREZ - NCBI • will use a LOT - Introduced in Lab 1 • SRS - Sequence Retrieval Systems - EBI • will use less, similar to ENTREZ Both: • Provide access to multiple databases • Allow complex queries BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs 8/22/07 47 BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 48 8 #2 - Biological Databases 8/22/07 Web Resources: Bioinformatics & Computational Biology ISU Resources & Experts ISU Research Centers & Graduate Training Programs: • Wikipedia: • • • • • • Bioinformatics • • • • • • NCBI - National Center for Biotechnology Information ISCB - International Society for Computational Biology JCB - Jena Center for Bioinformatics UBC - Bioinformatics Links Directory UWa - BioMolecules Pitt - OBRC Online Bioinformatics Resources Collection ISU Facilities: • ISU - Bioinformatics Resources - Andrea Dinkelman • ISU - YABI = "Yet Another Bioinformatics Index" • Biotechnology - Instrumentation Facilities • PSI - Plant Sciences Institute • PSI Centers (from BCB Lab at ISU) BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB Lab - (Student-Led Consulting & Resources) BCB - Bioinformatics & Computational Biology LH Baker Center - Bioinformatics & Biological Statistics CIAG - Center for Integrated Animal Genomics CILD - Computational Intelligence, Learning & Discovery NSF IGERT Training Grant - Computational Molecular Biology 8/22/07 49 8/22/07 51 BCB 444/544 F07 ISU Dobbs #2 - Biological Databases 8/22/07 50 SUMMARY: #2- Biological Databases BEWARE! BCB 444/544 F07 ISU Dobbs #2 - Biological Databases BCB 444/544 Fall 07 Dobbs 9