BioInformatics at FSU – what it is, who’s doing it, and why it needs to be done now. Steve Thompson Florida State University School of Computational Science and Information Technology (CSIT) Introductory outline: What is bioinformatics, genomics, sequence analysis, computational molecular biology . . . Reverse Biochemistry & Evolution. Database growth & cpu power. A very brief ‘Show and Tell,’ NCBI Resources, GCG’s SeqLab, phylogenetics. High quality training is essential! Graduates need to be competitive on a My definitions: Biocomputing and computational biology are synonymous and describe the use of computers and computational techniques to analyze any biological system, from molecules, through cells, tissues, and organisms, all the way to populations. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases. Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, mechanism, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within and across genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between The reverse analogy: from a ‘virtual’ biochemistry DNA sequence to actual molecular physical characterization, not the other way ‘round. Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural insights into a gene product, without the need to isolate and purify massive amounts of protein! Eventually you can go on to clone and express the gene based on that analysis using PCR techniques. The computer and molecular databases The exponential growth of molecular sequence databases & cpu power. Year BasePairs 1982 680338 1983 2274029 1984 3368765 1985 5204420 1986 9615371 1987 15514776 1988 23800000 1989 34762585 1990 49179285 1991 71947426 1992 101008486 1993 157152442 1994 217102462 1995 384939485 1996 651972984 1997 1160300687 1998 2008761784 1999 3841163011 2000 11101066288 2001 15849921438 2002 28507990166 Sequences 606 2427 4175 5700 9978 14584 20579 28791 39533 55627 Doubling time ~ 1 year! 78608 143492 215273 555694 1021211 1765847 2837897 4864570 10106023 14976310 22318883 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.ht ml Database growth (cont.) The Human Genome Project and numerous other genome projects have kept the data coming at alarming rates. As of April 2003, (50 years after the Watson-Crick double-helix!)16 Archaea, 128 Bacteria, and 10 Eukaryote complete, finished genomes; and 4 Vertebrate and 5 Plant essentially complete genome maps are publicly available for analysis; not counting all the virus and viroid genomes available. The International Human Genome Sequencing Consortium announced the completion of a "Working Draft" of the human genome in June 2000; independently that same month, the private company Celera Genomics announced that it had completed the first assembly of the human genome. Both articles were published mid-February 2001 in the journals Some neat stuff from those papers: We, Homo sapiens, aren’t nearly as special as we had once hoped we were. Of the 3.2 billion base pairs in our DNA — Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25,000 and 35,000! The protein coding region of our genome is only about 1% or so, much of the remainder ‘junk’ is ‘jumping,’ ‘selfish DNA’ of which much may be involved in regulation and control. Understanding this network is a huge challenge. 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! (Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer.) What are primary (Central Dogma: DNA —> RNA —> protein) sequences? Primary refers to one dimension — all of the ‘symbol’ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and structural information are not included within this sequence, however, much of this type of information is available in the reference documentation sections associated with primary What are sequence databases? These databases are an organized way to store the tremendous amount of sequence information that accumulates from laboratories worldwide. Each database has its own specific format. Three major database organizations around the world are responsible for maintaining most of this data; they largely ‘mirror’ one another. North America: National Center for Biotechnology Information (NCBI): GenBank & GenPept. Also Georgetown University’s NBRF Protein Identification Resource: PIR & NRL_3D. Europe: European Molecular Biology Laboratory (also EBI & ExPasy): EMBL & Swiss-Prot. Asia: The DNA Data Bank of Japan (DDBJ). Content & organization: Most sequence database installations are examples of complex ASCII/Binary databases, but they usually are not Oracle or SQL or Object Oriented (proprietary ones often are). They often contain several very long text files containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing index functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise. Nucleic acid databases are split into subdivisions based on taxonomy (historical). Protein databases are often What are other biological databases? Three dimensional structure databases: the Protein Data Bank and Rutgers Nucleic Acid Database. Still more; these can be considered ‘non-molecular’: Reference Databases: e.g. OMIM — Online Mendelian Inheritance in Man PubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. Phylogenetic Tree Databases: e.g. the Tree of Life. Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). Population studies data — which strains, where, etc. And then databases that most biocomputing folk don’t even usually consider: e.g. GIS/GPS/remote sensing data, medical records, census So how do you do bioinformatics? Often on the InterNet over the World Wide Web — Site URL (Uniform Resource Locator) Content Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/ databases/analysis/software PIR/NBRF http://www-nbrf.georgetown.edu/ protein sequence database IUBIO Biology Archive http://iubio.bio.indiana.edu/ database/software archive Univ. of Montreal http://megasun.bch.umontreal.ca/ database/software archive Japan's GenomeNet http://www.genome.ad.jp/ databases/analysis/software European Mol' Bio' Lab' http://www.embl-heidelberg.de/ databases/analysis/software European Bioinformatics http://www.ebi.ac.uk/ databases/analysis/software The Sanger Institute http://www.sanger.ac.uk/ databases/analysis/software Univ. of Geneva BioWeb http://www.expasy.ch/ databases/analysis/software ProteinDataBank http://www.rcsb.org/pdb/ 3D mol' structure database Molecules R Us http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' http://www.gdb.org/ The Human Genome visualization The Genome DataBase What other resources are available? Desktop software solutions — public domain programs are available, but . . . complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc., but . . . license hassles, big expense per machine, and Internet and/or CD database access all complicate matters! Therefore, UNIX server-based solutions, public domain or commercial (e.g. the Accelrys GCG Wisconsin Package [a Pharmacopeia Co.]): the SeqLab Graphical User Interface. University bioinformatics objectives: The university tripartite mission — Education, Research, and Service. Education: out-reach programs and undergraduate and graduate courses. Research: bioinformatics is becoming an indispensable tool in most biological research, particularly in molecular and cellular biology. Service: those faculty and staff that know bioinformatics should be available to assist with consultation, Education: Workshops — continue to teach GCG SeqLab tutorial series; each of the four sessions offered once per semester. Modules —across the university curricula within existing courses, interdisciplinary by nature, implications, & ethics. Graduate and Undergraduate Courses — presently three cross-listed biology courses; one introductory, team-taught survey, stressing practical, project-oriented approaches; one advanced algorithms lecture; one programming practicum. Computational Molecular Biology Program — proposed; to be in association and cooperating with students’ present major department, coordinated by CSIT. Pros and cons . . . GCG SeqLab workshop series: Four different sessions — Intro’ to SeqLab & Multiple Sequence Analysis and its supplement, Rational Primer Design, Database Searching & Pairwise Comparisons — Significance, Molecular Evolutionary http://bio.fsu.edu/~stevet/workshop.html Phylogenetics. FOR MORE INFO... Modules in existing courses: Cooperate with extant programs to incorporate bioinformatics into their existing curricula. Key is to demonstrate necessity of knowledge & offer full cooperation with departments. Potential courses exist across many different departments, and even across different colleges. Identify potential courses from the General Catalog and approach individual Courses at Florida State: Four different Special Topics Biology Department BSC4933/5936 Bioinformatics sections — First (first offered Spring 2002)— “Introduction to Bioinformatics” Steve Thompson et al. Covers both sequence and structural analysis. Team-taught; lecture + optional lab; pragmatic, real-world, project-oriented approach. Survey level — introduction to the theory + practical applications. Pluses and minuses; Courses (cont.) Second (first offered Fall 2002) — “Programming Skills for Computational Biology and Bioinformatics” David Swofford. The Java model, an object oriented framework. Third (first offered Spring 2003) — “Advanced Bioinformatics: Computational Methods” David Swofford. The theory behind sequence analysis algorithms. New (Fall 2003) — “Genomics and Courses (cont.) Departments other than Biology: Mathematics — MAP 5485 “Introduction to Mathematical Biophysics” Jack Quine. Mathematical tools in Biophysics. an integral part of their Biomedical Mathematics Program. Institute of Molecular Biophysics — Center Of Excellence In Biomolecular Computer Modeling & Simulation. In all courses — don’t ignore implications, ramifications, & ethics of bioinformatics research. Undergraduate opportunities: A special undergraduate fellowship — The Howard Hughes Undergraduate Program in Mathematical and Computational Biology. Twelve Hughes Fellows per year earn a $5000 stipend, a $1200 summer housing allowance, and a $1000 professional meeting allowance. Supported by two new undergraduate Computational Biology Program: Presently FSU computational biology is composed of a confusing mix of undergraduate and graduate programs across at least three different Departments from the College of Arts and Sciences. We propose a CSIT Coordinated ‘balloon’ program in association with student’s major department that would consolidate these efforts. Pros and Cons . . . Undergraduate and/or Graduate? Avoid duplication of effort. Candidate department collaborations. Summer short course: Long-range ‘pipe dream’? Broad spectrum — both instructors and students from many different disciplines and world-wide distribution. See MBL Mol’ Evol’ Workshop for a model. One or two weeks? On-campus room & board support. To be undertaken this course will Bioinformatics degree programs around the world: Relatively rare, but more are being created all the time. Biocomputing education URL’s are documented at: http://www.csit.fsu.edu/HHP/gradprog.ht ml Most are graduate course lists, many are graduate Masters or Ph.D. programs, some include undergraduate courses or programs. There is a huge need for Conclusions: Gunnar von Heijne in his old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” He continues: “. . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to FOR MORE INFO... Humana Press, Inc. also asked me to contribute. http://bio.fsu.edu/~stevet/cv.html. I’ve got two chapters in Contact me (stevet@bio.fsu.edu) for their — specific bioinformatics assistance Introduction to and/or long distance collaboration. Bioinformatics: Many fine texts are also starting to A Theoretical And become available in the field. Practical Approach To ‘honk-my-own-horn’ a bit, check http://www.humanapress.c out the new — om/Product.pasp?txtCatal Current Protocols in Bioinformatics og=HumanaBooks&txtCat from John Wiley & Sons, Inc: egory=&txtProductID=1http://www.does.org/cp/bioinfo.htm 58829-241-X&isVariant=0. l. Both volumes are now They asked me to contribute a available. chapter on multiple sequence Visit my Web page: