SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. Bioinformatics & 16S-rRNA-based phylogenetic analysis Laboratory Objectives After completion of this lab you should: 1. have a basic understanding of the working principle of modern DNA sequencing methods, most importantly of the Sanger method 2. have a deep understanding of biological databases and of the basic working principles of bioinformatics using common search tools and algorithms 3. be able to submit a DNA sequence, e.g. retrieved after DNA sequencing or from a public database, to the NIH/NCBI-hosted BLAST search engines 4. be able to do a basic interpretation of BLAST search results in the context of bacterial identification based on submitted 16S-rRNA gene sequences 5. understand the importance of bacterial rRNA genes, especially the evolutionary highly conserved gene for 16S-rRNA, for phylogenetic analysis in modern microbiology Necessary Materials & Equipment - - Bacterial 16S-rRNA gene sequence, e.g. retrieved after 16S-rRNA PCR and subsequent DNA sequencing - will be supplied by the instructor (see separate lab hand-out) Computer workstation with internet access Printer Paper & Writing materials Introduction In the past 20 years, the DNA sequence of thousands of genes and even of the whole DNA content, often referred to as the genome, of many life forms has been read with the help of a revolutionary new molecular biological technique called DNA sequencing. With the completion of the deciphering of the sequence of the complete human genomic DNA with its more than 3 billion base pairs at the beginning of this millennium, genetics and molecular biology holds the great promise to understand many fundamental processes in biology, such as embryonic development, growth, aging, cancer and the many heritable diseases at the molecular level DNA sequencing, which is the deciphering of the follow-up of nucleotides within a given DNA molecule, was possible with the introduction of the Sanger DNA sequencing method into modern lab routines The Sanger method is the most widely applied and meanwhile automated DNA sequencing method (see Graphic 1 below) - this elegant method uses 2′,3′-dideoxynucleoside triphosphates (ddNTPs), which lack a 3′-hydroxyl group 1 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. - 4 different ddNTPs (ddATP, dTTP, dGTP & dCTP), which are usually labeled with different fluorescent dyes (= fluorophores), are mixed with “non-labeled” dNTPs and added – together with DNA of unknown sequence as template – to a DNA polymerase enzyme - in this sequencing method, single-stranded DNA with unknown nucleotide sequence serves as the template strand for in vitro DNA synthesis with the help of the enzyme DNA-polymerase; whenever the DNA polymerase incorporates a ddNTP instead of of a dNTP during copying of the DNA template it stops and no further nucleotides are added to the copied strand due to the lacking 3’-OH group at the ddNTP - since the incorporation of a ddNTP instead of an dNTP at the growing daughter DNA strand is a random event, daughter DNA strands with different lengths are generated - the fluorescently labeled DNA daughter strands - with different lengths – are then separated with the help of long gel slabs or gel capillaries using gel electrophoresis - the method requires a synthetic 5′-end-fluorophore-labeled oligodeoxynucleotide as primer to start DNA synthesis Graphic 1: DNA Sequencing: The Sanger Method ddATP ssDNA (with unknown sequence) ddTTP ddC TP Electrophoresis Reading Sequencing gel Sequence deduction ddGTP A T G C C A G G A C G C T G A T DN A Sequence (of ssDNA) Graphic©E.Schmid-2006 2 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. Knowledge of the complete DNA sequence not only of the human genome but also of other life forms including bacteria has the great potential to pave the way for the development of sophisticated, improved DNA-based diagnostic tools and technologies to test for mutations and to directly compare the DNA sequences of different life forms to retrace evolutionary change and for unraveling genetic relatedness. The human genetic map indeed holds great promise for biology and medicine, to locate, identify and therapeutically target genes responsible for currently noncurable human genetic disorders, such as neurofibromatosis (NF), cystic fibrosis (CF) and X-linked severe combined immunodeficiency (XCID), and of other human mal-functions in the near future The knowledge of the gene sequences and even whole genomes of bacteria and microbial pathogens allows its use to find better cures against microbial diseases, develop new, DNA-based vaccines, to accelerate bacterial detection and also to use the DNA sequences for retracing genetic and evolutionary relationships amongst different life forms; the latter purpose is referred to as phylogenetic analysis However, in order to use, sort and handle the vast amount of gene and genome DNA sequence data, biologists begun to incorporate sophisticated computer tools and mathematical algorithms into their work, to analyze, interpret and predict the structure and function of many of the many identified DNA sequences Not too surprising, that the completion of the sequencing of many bacterial genomes, e.g. E. coli, and of the human genome, co-incited with the advent of a new subdiscipline of modern biology, commonly referred to as Bioinformatics Bioinformatics is the study of genetic and other biological information using computer technology together with statistical techniques and algorithms; it means the scientific use of computer hard- and software to retrieve, compare and analyze biological data, most importantly DNA nucleotide sequences, protein sequences and three-dimensional protein structures The primary goal of computational molecular biology is to understand the meaning of the genomic information and how this information is expressed in form of gene patterns, proteins and enzymes With the knowledge of more and more completely sequenced genomes, transcriptomes and proteomes of many biological organisms (see Table I), more and more scientists incorporate bio-informatics into their work to answer crucial questions Today bio-informatics is routinely used to: 1. Compare and analyze the nucleotide- and amino acid sequences of different organisms for: a. conserved sequences b. homologous regions (sequence homology) 2. Predict the biological function of genes and proteins from their primary DNA sequence - for example: Isolation and deduction of the biological function of the NF1 gene from cancer patients suffering from neurofibromatosis (NF) for an overview see Graphic 2 below 3 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. Graphic 2: High sequence homology in the C-terminus of the human NF1 protein and the Saccharomyces cerevisiae Ira protein Procedure: NF1 cancer patient Isolate NF1 cells Isolat Make cDNA library cDNA clone & sequence NF1 Protein Isolation & DNA Sequencing Translate deduced amino acid sequence Submit Submit BLAST (blastx) Query mRNA or BLAST (blastn) Homologies after Sequence homology search Match Ira Protein (yeast) Function: rasGAP protein Graphic©E.Schmid-2006 4 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. 3. Predict the 3-dimensional structure of identified proteins and RNA from its linear sequences by comparison with sequences of other known proteins and/or RNA, which three-dimensional structures have been successfully resolved 4. understand how and when genes are expressed (= gene expression analysis) The major task of modern bio-informatics is the computer-assisted search for: 1. similar sequences (= homology); 2. functional domains and 3. structural similarities in the exponentially growing DNA- or protein data banks Before we can look up and make ourselves familiar with the GenBank and with BLAST, the currently most widely used bio-informatics tool, we have understand the essential terminology used in bio-informatics: 1. Homology refers to gene or protein sequences with similar sequences, structures and functions it is the key concept that relates sequence similarity to inferences about structure and function 2. Genome the entire chromosomal genetic material of an organism 3. Genomics the comprehensive study of whole sets of genes and their interactions rather than single genes the most widely used “tools” to perform these studies are the so-called DNA microarrays, often referred to as “gene chips” 4. Proteome the full complement of proteins within a cell or organism, produced by a particular genome 5. Locus (Plural: loci) chromosomal location of a gene or other piece of DNA 6. Pseudogene a sequence of DNA similar to a gene but non-functional probably the remnant of a earth history once-functional gene that accumulated too many mutations 7. Repetitive DNA DNA sequences of varying lengths, such as Alu, LINE or SINE sequences, that occur in multiple copies in the genome it represents a majority part of the genome of some biological organisms, e.g. Homo sapiens (humans) are usually not considered (= filtered out) by the most widely used 5 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. sequence homology search algorithms, e.g. BLAST 8. Conserved DNA sequences nucleotide sequence(s) which are found in highly similar sequence versions in a variety of different genes or genomes of other life forms sequences which did not undergo significant changes and have been “evolutionary conserved” by some enigmatic mechanism 9. Annotations additional information , such as origin of sequence (animal, plant, etc,), key features of sequences (start/stop codons, etc.), references to journal articles, which are linked to each sequence entry in certain data bases 10. BLAST a NIH website accessible search tool that allows identification of homologous sequences of genes and proteins of different organisms Today, more and more biological data are stored, retrieved and analyzed with the help of computer systems Especially molecular biological data, such as: 1. completed or drafted nucleotide sequences of genes 2. the nucleotide sequences of whole genomes of biological 3. the sequenced or deduced amino acid sequences of protein fragments or of complete proteins 4. the structural (= 3D) coordinates of molecules, macro-molecules, protein fragments or complete proteins are submitted and banked with the help of computer systems in so-called databanks Worldwide, there are currently several dozen servers that provide access to over 300 different databanks The currently largest and most comprehensive databanks are run, evaluated, exchanged and daily updated by publicly funded organizations: 1. NIH/NCBI’s GenBank (U.S.A.) 2. EBI/EMBL Nucleotide Data Bank (E.U.) 3. DNA Database of Japan (DDBJ) the data stored in these three databanks as well as in the databanks run by smaller, public funded research organizations, are open to the public and can be accessed via the internet free of charge In the past years, a series of privately owned companies, e.g. Celera, Incyte, started highly ambitious genome sequencing programs with the goal to establish their own, propriety data banks 6 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. - companies like Celera also developed its own, sophisticated annotation and database search programs - these propriety-owned data banks can only be accessed by paid-subscription viewers In general, all these databanks provide the accessed researcher with “sextant, compass and charts” in form of sophisticated computer algorithms and “search engines” , that enable a targeted navigation through the genome maps Before getting started with this bioinformatics lab let’s look at the most important biological databanks in more detail, which major focus on genomic data bases. GenBank (☺) (NIH/NCBI, USA) GenBank® is the genetic sequence database hosted and daily updated by the National Institutes of Health (NIH) It comprises an annotated collection of all publicly available DNA sequences, which can be accessed free-of-charge via the web-site of the NIH-supported NCBI (= National Center for Biotechnology Information) under following web address: http://www.ncbi.nlm.nih.gov/ as of April 2001, there are approximately 12,419,000,000 bases in 11,546,000 sequence records in this database GenBank is essential part of the International Nucleotide Sequence Database Collaboration, which further comprises the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL) All three publicly funded databases exchange their new database entries on a daily bases (☺) You will be accessing this database in this course! EMBL Nucleotide Sequence Database (European Bioinformatics Institute = EBI, Europe) This data bank run by the EMBL out-station EBI, constitutes Europe's primary nucleotide sequence main resource for DNA and RNA sequences The EMBL Nucleotide Sequence Database contains 14,366,182 entries comprising 15,383,451,165 nucleotides The nucleotide database and other bioinformatics resources can be accessed free of charge via the internet under: http://www.ebi.ac.uk DNA Database of Japan (DDBJ) TIGR Database (= databases of the “The Institute for Genomic Research (TIGR)” TIGR is a not-profit research institute located in Rockville, MD (U.S.A.), which was founded in 1992 7 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. Its major research interests lays in structural, functional and comparative analysis of genomes and gene products from a wide variety of organisms including viruses, pathogenic/non-pathogenic bacteria, archaea bacteria and eukaryotes the TIGR databases contain finished and unfinished DNA sequences of a diversity of bacterial and plant genomes, such as Mycobacterium tuberculosis, Helicobacter pylori or Arabidopsis thaliana 1995, TIGR was the first research institution to completely sequence the whole genomes of two bacteria: Haemophilus influenzae and Mycoplasma genitalium scientists at TIGR were also the first to complete the genome sequences of the archaea bacterium Methanococcus jannaschii (1996) and the oral disease-causing microbe Porphyromonas gingivalis the TIGR databases can be accessed free-of-charge via the internet under following web address: http://www.tigr.org/tdb/ 8 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. ____________________________ (Student Name / Team) ____________ (Date) Procedure a. Have your Bio-informatics worksheet hand-out showing the 16S rRNA gene sequence of an unknown bacterium ready and have a seat in front of the computer workstation - your instructor will hand this worksheet out at the beginning of this lab session b. Use the internet-connected computer system and open the NIH/NCBI (National Institutes of Health/National Center for Biotechnology Information) home page by typing in following web address: http://www.ncbi.nlm.nih.gov c. In the upper section, mouse-click on the ‘BLAST’ icon to access the BLAST program d. Scroll down to the ‘Basic BLAST’ section to choose a BLAST program; click on the ‘nucleotide blast’ hyperlink text to access the nucleotide-nucleotide sequence analysis section of BLAST e. In the ‘Enter Query Sequence’ window, type in the DNA sequence you received with your worksheet hand-out - work in teams of two and have one team member spelled the nucleotide sequence to the one member sitting in front of the computer work station - make sure that you do not make any type-in errors while doing this important job f. Go to the ‘Choose Search Set’ section and under ‘Database’, click on the ‘Others (nr, etc.)’ icon to select the Genbank database for your query g. Click on the ‘BLAST’ search icon at the bottom of this page to start the sequence similarity search of your submitted bacterial 16S rRNA gene sequence within the GenBank database h. The data base search will take some time, but after a couple of seconds you should receive a similar result report including an overview graphic similar to the one shown in Graphic 2 below i. Analyze these results carefully and answer the following questions below. j. After you are done with your analysis (and hopefully got an idea from which bacterium the DNA was isolated) turn the completely filled-in pages 10 & 11 in as part of your weekly lab report (Don’t forget to put your name and the date on the 2 sheets) 9 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. _____________ (Date) ______________________ (Student Name) What was the ID number of the DNA sequence you submitted to BLAST? DNA Sequence ID #: _____ How many BLAST hits (if at all) did you get with your 16S rRNA query sequence? ___________ hits Scroll further down to the ’Sequences producing significant alignments’ section of the BLAST results data sheet; look up the “hit list”, i.e. the list showing the microorganisms with the highest sequence homology; the bacterium named on top of this list is the one with the best match to your submitted sequence (= query sequence) Write down following pieces of information about your best matching bacterium, which is, the bacterium I. What is the name of the top-scoring bacterium? _____________________________________ II. Which gene is matched, i.e. has the highest sequence homology with your submitted 16S rRNA gene (query) sequence? _______________ gene III. What is the best (= highest) maximum and total score of the best matching bacterium? Max Score = _______ Total Score = _______ IV. What is the GenBank accession number of the highest scoring bacterium? Accession Number: _______________ Now click on the ‘Accession’ number hyperlink of your best matching bacterium and retrieve further information about this microorganism. 10 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. V. Who deposited the DNA sequence of the bacterium your submitted DNA sequence has the highest sequence homology with? And when? Depositor(s): ___________________________________ Year: ___________________________________ VI. What else can you say about the bacterium your DNA sequence has the highest DNA sequence homology with? Try to retrieve further information about it, e.g. origin, source from which the bacterium was isolated from, where, when, etc. __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ VII. Now, what can you speculate about the nature of your unknown bacterium from which you isolated the DNA and did the 16S rRNA gene analysis? __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ 11 SAN DIEGO MESA COLLEGE Introduction Molecular Cell Biology Laboratory (Bio210A) Instructor: Elmar Schmid, Ph.D. nBLAST Result of Nucleotide Sequence Homology Search with the Thermus thermophilus (Tt) SSB gene sequence Query Sequence (= Tt-SSB gene) 1. Best Match 2. Best Match Low Sequence Homology Matches 12