3nd Essential Practical Bioinformatics Workshop Introductory Lecture: Why Bioinformatics? Tan Tin Wee Department of Biochemistry, YLL School of Medicine, NUS and Victor Tong Joo Chuan I2R, Department of Biochemistry, YLL School of Medicine (Adjunct), NUS Mohammad Asif Khan Perdana University Graduate School of Medicine, Malaysia 1 Some definitions of Bioinformatics Bioinformatics is “the development and application of techniques from computer science, mathematics and statistics to address biological problems” Dubitzky (Brief Bioinform. 2009; 10: 343) Bioinformatics is “the study of the information content and information flow in biological systems and processes”. Michael Liebman in “Bioinformatics: An Editorial Perspective” (http://www.netsci.org/Science/Bioinform/feature01.html) 2 Information flow in Biology How does the plane fly from A to B? A 747-400 has six million parts A 747-400 has 274 km of wiring and 8 km of tubing. Seventy-five thousand engineering drawings were used to produce the first 747. Will we be able to understand how a pilot flies or lands the plane ie. the behaviour of a plane, if we took it all apart? Can I get to understand how the ppt presentation works if I take the parts of my computer apart and analyse every chip and every transistor? What drives us to study the Life Sciences? Unraveling the Mysteries of Life! SCALE Time Space Image: http://instruct1.cit.cornell.edu/courses/bioes278_lecture/Topic1Basic_Concepts/07-Levels_of_organisation.jpg (Accessed Aug 16, 2006) 6 Why Bioinformatics? Bioinformatics is at the beginning of taking Biology to the next level: Studying living things with information technology, and with computing systems, and Thinking about living systems with information theory. Unraveling the Mysteries of Life! With Information Theory? Soul? Quantum Information? Mind and Consciousness Brain Network of Neurons Epigenetic Code DNA Genetic Information http://www.labgrab.com/ users/triptangent/blog/ laser-induced-gamma-wavesoffer-insight-brain-functions DEPTH 8 Living Things and Life Sciences are special! Information Energy Matter Information flow in Biology Central Dogma and the “omics” DNA RNA Protein Genomics Transcriptomics Proteomics Regulation Interactomics Metabolism Metabolomics Degradation/degradomics, Immunity/immunomics…etc 11 Structure of DNA JD Watson and FHC Crick April 25, 1953 Nature 171, 737-738 (1953) Fred Sanger (1975) Dideoxy chain termination DNA sequencing J. Mol. Biol. 94 (3): 441–8 Kary Mullis (1983) Polymerase Chain Reaction (PCR) Human Genome Project formally Began in October 1990, funded by DoE and NIH, USA Completion of the first assembly of the human genome June 26, 2000 in Washington Bill Clinton with Craig Venter and Francis Collins 13 years $300 million 1995 Haemophilus (Bacteria) 1.6 Mb, ~1600 genes [Science 269: 496] 1997 Eukaryote (budding yeast), 13 Mb, ~6K genes [Nature 387: 1] 1998 C elegans (Worm, Animal) ~100 Mb, ~20K genes [Science 282: 1945] 2000 Human, ~3 Gb, ~100K genes Genomes highlight the Finiteness of the “data” in Biology “Next Gen” and 3rd Gen Sequencing 454 Life Sciences Pyrosequencing (2005) Illumina (Solexa) (2006) Latest HiSeq up to 200 Gb per run, 2 x 100 bp read length, up to 25 Gb per day In a single run, sequence two human genomes at ~30x coverage for less than $10,000 (USD) per genome ABi (Life Technologies) (2007) SOLiD Sequencing by Oligonucleotide Ligation and Detection) 60 gigabases of usable DNA data per run 3rd Gen DNA Sequencing (New!) Single Molecule PCR-independent Sequencing - Heliscope Single Molecule Sequencer (2009) -Pacific Biosciences SMRT™ (2009) 100 Gb/hr >1000base/read 2010 5 days: Haitian Cholera epidemic genome completed Jonas Korlach & Steve Turner 14 3rd Gen Sequencing Impact of RNA-seq for transcriptomics BS-seq (bisulfite seq) for DNA methylations ChIP-seq for DNA-protein interactions CNV-seq for copy number variations Newer 3rd Gen DNA Sequencing (New!) -Nanopore sequencing (Oxford Nanopore and UCSC) -towards electronic, single molecule DNA sequencing of DNA strands -http://www.nanoporetech.com/press_releases/detail/114 - Ion Torrent - Quantum Dot sequencing (future?) Bioinformatics is crucial for Personal Genomics Another Wave Of Exponential Growth? The $1,000 Human Genome? Exponential Growth 23andMe.com deCODEme Navigenics Knome Hellogenome (Theragen) Complete Genomics (USA) Beijing Genome Institute (BGI) Etc etc Bioinformatics in genomics “the development and application of techniques from computer science, mathematics and statistics to address biological problems” Solving the data deluge in Personal genomics Cancer genomics 1000 genomes project How to store the data? How to analyse the data? How to extract information and knowledge from the data? From genome (1D) to structure (3D) Crystal to Structure Pipeline Crystal Management Crystal Mounting Crystal Alignment Crystal Description ? Data Collection Structure Determination A S A P Automation of individual process steps Systems integration of automated procedures Knowledge-based approach to structure determination Crystal description based on diffraction pattern & micro-beam exposures Structure Determination based on ASAP; processing engines developed by collaborators © John Wooley, UCSD 2003 The Industrial Scale Discovery PipeLine of JCSG HT Pipeline Processes, Bottlenecks and Leaks target selection expression cloning imaging harvesting bl xtal mounting xtal screening publication PDB annotation purification crystallization data collection struc. validation phasing tracing struc. refinement © John Wooley, UCSD 2003 Growth of Protein Databank PDB Structures Common cold: structure of the protein shell, or capsid, of the human rhinovirus. Credit: J. Y. Sgro, UW-Madison Molecules complexes, pathways Regulating biological processes Jak-STAT Signaling Pathway. KEGG Pathways. http://www.genome.jp/kegg/pathway/hsa/hsa04630.html (Accessed: Aug 5, 2011) 22 Multiple pathways in a cell… Cells have multiple processes that must be coordinated Metabolism Pathways. KEGG Pathways. http://www.genome.jp/k egg/pathway/map/map 01100.html (Accessed: Aug 5, 2011) 23 Parts of a machine Image: http://www.edwardsheattreating.co m/images/machine%20parts.jpg Image: http://www.sperdvac.org/Horizontal %20Mill/milling%20machine.jpg And so, we study the individual parts of the machine in order to understand the machine itself. 24 Flagellar system http://www.fbs.osaka-u.ac.jp/en/seminar/image/09_img13.jpg © Protonic NanoMachine Project, ERATO, Japan (Namba) http://www.fbs.osaka-u.ac.jp/en/seminar/image/09_img12.jpg Interactomics Which proteins (biomolecules) interact with which proteins (biomolecules)? Pathway information Stanyon et al. Genome Biology 2004 5:R96 26 Cellular “Circuitry” Representing and simulating the well known genetic switch mechanism of l lambda phage to choose between lysis and lysogeny growth pathways – e.g. hybrid functional Petri net technique http://www.genomicobject.net/member3/GONET/lambda.htm l E-cell - Model builder - Algorithm modules - Simulation visualiser © E-cell.org Tissue-Organ Modeling – Simulating heart myocardial function The Physiome Project © Peter Hunter http://ep.physoc.org/cgi/content/full/89/1/1 BioImaging Technologies With Genomics, Proteomics Computational Biology © Nature Cell Biology 2003 Integrative Biology © Nature Cell Biology 2003 Novartis Institute for Tropical Diseases (NITD) STOPDengue Project Integrative Biology helping Industry © Nature Cell Biology 2003 Bioinformatics and Computational Biology Underpin Integrative Biology to handle complexity of biological data 3D 1D 6D 2D Spectrum of Bioinformatics and Computational Biology Tissue/Organ Physiology “E-Cell” simulation COMPUTATIONAL INFORMATIONAL Regulatory Networks/Circuits Pathways In silico research Interactions Function Structure Sequence Genomics Proteomics Transcriptomics Metabolomics EXPERIMENTAL Other ‘omics OBSERVATIONAL BioImaging In vitro In vivo research Biology is Big Science these days After the Genomes projects, industrial scale generation of data is no big deal. Sophisticated bioinstrumentation from automated sequencers to microarray systems 24by7 churn out ever increasingly large scales of output, throughput and data generation. Stanford BioX (US$ 150M) MIT CSBi (US$10M/yr) Princeton Sigler Inst for Integrative Genomics ICAHN Lab (US$40M) Duke Institute for Genome Sciences and Craig Venter’s TCAG (US$250M) UMichigan LSI (US$380M) QB3 UCaliforniaSF /Scruz/Berkeley (US$200M) Cornell LSI (US$140) UCSD JCSG © GenomeWeb LLC 2003 Singapore’s Mechanobiology Institute S$150M Bioinformatics and Computational Biology underpins the research process. The Biological Data Deluge of Volume and Complexity Critical Need for Bioinformatics and Computational Biology expertise in the next generation of Biologists 20th C: Century of Physics and the Atom Bomb 21st C: Century of Biology and Biotechnology The Economist Feb/Mar 2010 What does all of this have to do with you??? Basic IT and Computer Literacy in Biologists? 39 Basics of Bioinformatics in Brief Bioinformatics applications are made up of many combinations ... Visualisation Data Database Algorithm 40 Biological Databases Collect, organize and classify data Query the dataset Retrieve entries based on keyword search Limitations of databases Image from Entrez Gene. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene (Accessed Aug 5, 2011) 41 Sequence Analysis and other bioinformatics software Why bioinformatics software? Examples of selected software Scope and sources of bioinformatics software Cautions re: bioinformatics software Accessing and using bioinformatics software 42 Sequence Comparison, Alignment, Assembly After collecting a set of related sequences, how can we compare them as a set? How should we line up the sequences so that the most similar portions are together? What do we do with sequences of different lengths? How can we compare a given sequence to the millions in the database? Which ones are truly related by evolution? What can the study of related sequences tell us? 43 Patterns and Motifs What is the signature found in groups of sequences? How can we use these signature patterns or motifs for rapid identification of familial relationships? Can patterns be used to assign function? Image: http://www44 lecb.ncifcrf.gov/~toms/sequencelogo.html Evolution and Phylogenetic Analysis What is a phylogenetic tree? Algorithms used to generate a phylogenetic tree Decisions governing choice of phylogenetic programs Interpreting genome structure using phylogeny Image: David Begun http://www.newsandevents.utoronto. ca/bios/askus4.htm 45 Structure Visualization Using graphic tools to view structures Simple commands to analyse structures and active sites Different graphic representations and colouring schemes The function of a protein is a consequence of its folded state: Anfinsen, 1961 The 3D fold of a protein is called its structure In 3D, the business end of the protein has contributions from different regions of its sequence Image: Eric Martz RasMol Gallery. http://www.umass.edu/microbio/ rasmol/galmz.htm 46 Course objectives of the workshop Students will be sufficiently familiar with a set of common bioinformatics resources, and their underlying concepts, such that: You will be able to apply these resources appropriately to solve biological questions. You will be able to independently identify, use and assess additional resources as they become available. You will be prepared to pursue more advanced bioinformatics training. You will begin to think computationally and informatically in doing biology 47 Bioinformatics and Computational Thinking in Solving Problems in Biology Database Algorithms Analysis Visualization Biology Software Applications 48 READINGS: Johnathan Pevsner (2009) Bioinformatics and Functional Genomics (2nd edition). Wiley-Blackwell Chapters 1 and 2; relevant database sections in Chapters 19 and 20 See http://bioinfbook.org/ Arthur Lesk (2008) Introduction to Bioinformatics. (3rd ed) Chapters 3 and 4. Oxford University Press. (Optional) For Practicals – if you are still lost. Jean-Michel Claverie and Cedric Notredame (2007) Bioinformatics for Dummies. (2nd Edition) Wiley Publishing. (optional)