PCR amplification of the bacterial genes coding for nucleic acid

advertisement
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
Bioinformatics & 16S-rRNA-based phylogenetic analysis
Laboratory Objectives
After completion of this lab you should:
1. have a basic understanding of the working principle of modern DNA
sequencing methods, most importantly of the Sanger method
2. have a deep understanding of biological databases and of the basic working
principles of bioinformatics using common search tools and algorithms
3. be able to submit a DNA sequence, e.g. retrieved after DNA sequencing or
from a public database, to the NIH/NCBI-hosted BLAST search engines
4. be able to do a basic interpretation of BLAST search results in the context of
bacterial identification based on submitted 16S-rRNA gene sequences
5. understand the importance of bacterial rRNA genes, especially the evolutionary
highly conserved gene for 16S-rRNA, for phylogenetic analysis in modern
microbiology
Necessary Materials & Equipment
-
-
Bacterial 16S-rRNA gene sequence, e.g. retrieved after 16S-rRNA PCR and
subsequent DNA sequencing
- will be supplied by the instructor (see separate lab hand-out)
Computer workstation with internet access
Printer
Paper & Writing materials
Introduction




In the past 20 years, the DNA sequence of thousands of genes and even of the whole
DNA content, often referred to as the genome, of many life forms has been read with
the help of a revolutionary new molecular biological technique called DNA
sequencing.
With the completion of the deciphering of the sequence of the complete human
genomic DNA with its more than 3 billion base pairs at the beginning of this
millennium, genetics and molecular biology holds the great promise to understand
many fundamental processes in biology, such as embryonic development, growth,
aging, cancer and the many heritable diseases at the molecular level
DNA sequencing, which is the deciphering of the follow-up of nucleotides within a
given DNA molecule, was possible with the introduction of the Sanger DNA
sequencing method into modern lab routines
The Sanger method is the most widely applied and meanwhile automated DNA
sequencing method (see Graphic 1 below)
- this elegant method uses 2′,3′-dideoxynucleoside triphosphates (ddNTPs), which
lack a 3′-hydroxyl group
1
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
- 4 different ddNTPs (ddATP, dTTP, dGTP & dCTP), which are usually labeled with
different fluorescent dyes (= fluorophores), are mixed with “non-labeled” dNTPs and
added – together with DNA of unknown sequence as template – to a DNA
polymerase enzyme
- in this sequencing method, single-stranded DNA with unknown nucleotide sequence
serves as the template strand for in vitro DNA synthesis with the help of the enzyme
DNA-polymerase; whenever the DNA polymerase incorporates a ddNTP instead of
of a dNTP during copying of the DNA template it stops and no further nucleotides are
added to the copied strand due to the lacking 3’-OH group at the ddNTP
- since the incorporation of a ddNTP instead of an dNTP at the growing daughter DNA
strand is a random event, daughter DNA strands with different lengths are generated
- the fluorescently labeled DNA daughter strands - with different lengths – are then
separated with the help of long gel slabs or gel capillaries using gel electrophoresis
- the method requires a synthetic 5′-end-fluorophore-labeled oligodeoxynucleotide
as primer to start DNA synthesis
Graphic 1: DNA Sequencing: The Sanger Method
ddATP
ssDNA
(with unknown sequence)
ddTTP
ddC TP
Electrophoresis
Reading
Sequencing gel
Sequence
deduction
ddGTP
A
T
G
C
C
A
G
G
A
C
G
C
T
G
A
T
DN A
Sequence
(of ssDNA)
Graphic©E.Schmid-2006
2
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.







Knowledge of the complete DNA sequence not only of the human genome but also of
other life forms including bacteria has the great potential to pave the way for the
development of sophisticated, improved DNA-based diagnostic tools and
technologies to test for mutations and to directly compare the DNA sequences of
different life forms to retrace evolutionary change and for unraveling genetic
relatedness.
 The human genetic map indeed holds great promise for biology and medicine, to
locate, identify and therapeutically target genes responsible for currently noncurable human genetic disorders, such as neurofibromatosis (NF), cystic fibrosis
(CF) and X-linked severe combined immunodeficiency (XCID), and of other human
mal-functions in the near future
 The knowledge of the gene sequences and even whole genomes of bacteria and
microbial pathogens allows its use to find better cures against microbial diseases,
develop new, DNA-based vaccines, to accelerate bacterial detection and also to
use the DNA sequences for retracing genetic and evolutionary relationships
amongst different life forms; the latter purpose is referred to as phylogenetic
analysis
However, in order to use, sort and handle the vast amount of gene and genome DNA
sequence data, biologists begun to incorporate sophisticated computer tools and
mathematical algorithms into their work, to analyze, interpret and predict the structure
and function of many of the many identified DNA sequences
Not too surprising, that the completion of the sequencing of many bacterial genomes,
e.g. E. coli, and of the human genome, co-incited with the advent of a new subdiscipline of modern biology, commonly referred to as Bioinformatics
Bioinformatics is the study of genetic and other biological information using
computer technology together with statistical techniques and algorithms; it means the
scientific use of computer hard- and software to retrieve, compare and analyze
biological data, most importantly DNA nucleotide sequences, protein sequences and
three-dimensional protein structures
The primary goal of computational molecular biology is to understand the meaning of
the genomic information and how this information is expressed in form of gene
patterns, proteins and enzymes
With the knowledge of more and more completely sequenced genomes,
transcriptomes and proteomes of many biological organisms (see Table I), more and
more scientists incorporate bio-informatics into their work to answer crucial questions
Today bio-informatics is routinely used to:
1. Compare and analyze the nucleotide- and amino acid sequences of different
organisms for:
a. conserved sequences
b. homologous regions (sequence homology)
2. Predict the biological function of genes and proteins from their primary DNA
sequence
- for example: Isolation and deduction of the biological function of the NF1
gene from cancer patients suffering from neurofibromatosis (NF)
 for an overview see Graphic 2 below
3
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
Graphic 2: High sequence homology in the C-terminus of the human
NF1 protein and the Saccharomyces cerevisiae Ira protein
Procedure:
NF1 cancer
patient
Isolate
NF1 cells
Isolat
Make
cDNA library
cDNA clone &
sequence
NF1
Protein
Isolation &
DNA Sequencing
Translate
deduced
amino
acid sequence
Submit
Submit
BLAST
(blastx)
Query
mRNA
or
BLAST
(blastn)
Homologies after
Sequence homology
search
Match
Ira
Protein
(yeast)
Function:
rasGAP
protein
Graphic©E.Schmid-2006
4
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
3. Predict the 3-dimensional structure of identified proteins and RNA from its
linear sequences by comparison with sequences of other known proteins
and/or RNA, which three-dimensional structures have been successfully
resolved
4. understand how and when genes are expressed (= gene expression analysis)

The major task of modern bio-informatics is the computer-assisted search for:
1. similar sequences (= homology); 2. functional domains and 3. structural similarities
in the exponentially growing DNA- or protein data banks

Before we can look up and make ourselves familiar with the GenBank and with
BLAST, the currently most widely used bio-informatics tool, we have understand the
essential terminology used in bio-informatics:
1. Homology
 refers to gene or protein sequences with similar sequences, structures and
functions
 it is the key concept that relates sequence similarity to inferences about
structure and function
2. Genome
 the entire chromosomal genetic material of an organism
3. Genomics
 the comprehensive study of whole sets of genes and their interactions
rather than single genes
 the most widely used “tools” to perform these studies are the so-called
DNA microarrays, often referred to as “gene chips”
4. Proteome
 the full complement of proteins within a cell or organism, produced by a
particular genome
5. Locus (Plural: loci)
 chromosomal location of a gene or other piece of DNA
6. Pseudogene
 a sequence of DNA similar to a gene but non-functional
 probably the remnant of a earth history once-functional gene that accumulated
too many mutations
7. Repetitive DNA
 DNA sequences of varying lengths, such as Alu, LINE or SINE sequences,
that occur in multiple copies in the genome
 it represents a majority part of the genome of some biological organisms,
e.g. Homo sapiens (humans)
 are usually not considered (= filtered out) by the most widely used
5
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
sequence homology search algorithms, e.g. BLAST
8. Conserved DNA sequences
 nucleotide sequence(s) which are found in highly similar sequence versions
in a variety of different genes or genomes of other life forms
 sequences which did not undergo significant changes and have been
“evolutionary conserved” by some enigmatic mechanism
9. Annotations
 additional information , such as origin of sequence (animal, plant, etc,), key
features of sequences (start/stop codons, etc.), references to journal
articles, which are linked to each sequence entry in certain data bases
10. BLAST
 a NIH website accessible search tool that allows identification of homologous
sequences of genes and proteins of different organisms

Today, more and more biological data are stored, retrieved and analyzed with the help
of computer systems

Especially molecular biological data, such as:
1. completed or drafted nucleotide sequences of genes
2. the nucleotide sequences of whole genomes of biological
3. the sequenced or deduced amino acid sequences of protein fragments or
of complete proteins
4. the structural (= 3D) coordinates of molecules, macro-molecules, protein
fragments or complete proteins
are submitted and banked with the help of computer systems in so-called
databanks

Worldwide, there are currently several dozen servers that provide access to over 300
different databanks

The currently largest and most comprehensive databanks are run, evaluated,
exchanged and daily updated by publicly funded organizations:
1. NIH/NCBI’s GenBank (U.S.A.)
2. EBI/EMBL Nucleotide Data Bank (E.U.)
3. DNA Database of Japan (DDBJ)
 the data stored in these three databanks as well as in the databanks run by
smaller, public funded research organizations, are open to the public and can be
accessed via the internet free of charge

In the past years, a series of privately owned companies, e.g. Celera, Incyte, started
highly ambitious genome sequencing programs with the goal to establish their own,
propriety data banks
6
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
- companies like Celera also developed its own, sophisticated annotation and
database search programs
- these propriety-owned data banks can only be accessed by paid-subscription
viewers

In general, all these databanks provide the accessed researcher with “sextant,
compass and charts” in form of sophisticated computer algorithms and “search
engines” , that enable a targeted navigation through the genome maps

Before getting started with this bioinformatics lab let’s look at the most important
biological databanks in more detail, which major focus on genomic data bases.
GenBank (☺) (NIH/NCBI, USA)
 GenBank® is the genetic sequence database hosted and daily updated by the
National Institutes of Health (NIH)
 It comprises an annotated collection of all publicly available DNA sequences,
which can be accessed free-of-charge via the web-site of the NIH-supported
NCBI (= National Center for Biotechnology Information) under following web
address:
http://www.ncbi.nlm.nih.gov/



as of April 2001, there are approximately 12,419,000,000 bases in 11,546,000
sequence records in this database
GenBank is essential part of the International Nucleotide Sequence Database
Collaboration, which further comprises the DNA DataBank of Japan (DDBJ)
and the European Molecular Biology Laboratory (EMBL)
All three publicly funded databases exchange their new database entries on a
daily bases
(☺)  You will be accessing this database in this course!
EMBL Nucleotide Sequence Database (European Bioinformatics Institute =
EBI, Europe)
 This data bank run by the EMBL out-station EBI, constitutes Europe's primary
nucleotide sequence main resource for DNA and RNA sequences
 The EMBL Nucleotide Sequence Database contains 14,366,182 entries
comprising 15,383,451,165 nucleotides
 The nucleotide database and other bioinformatics resources can be accessed
free of charge via the internet under:
http://www.ebi.ac.uk
DNA Database of Japan (DDBJ)
TIGR Database (= databases of the “The Institute for Genomic Research (TIGR)”
 TIGR is a not-profit research institute located in Rockville, MD (U.S.A.), which was
founded in 1992
7
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.





Its major research interests lays in structural, functional and comparative analysis
of genomes and gene products from a wide variety of organisms including viruses,
pathogenic/non-pathogenic bacteria, archaea bacteria and eukaryotes
the TIGR databases contain finished and unfinished DNA sequences of a diversity
of bacterial and plant genomes, such as Mycobacterium tuberculosis, Helicobacter
pylori or Arabidopsis thaliana
1995, TIGR was the first research institution to completely sequence the whole
genomes of two bacteria: Haemophilus influenzae and Mycoplasma genitalium
scientists at TIGR were also the first to complete the genome
sequences of the archaea bacterium Methanococcus jannaschii (1996) and the
oral disease-causing microbe Porphyromonas gingivalis
the TIGR databases can be accessed free-of-charge via the internet under
following web address:
http://www.tigr.org/tdb/
8
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
____________________________
(Student Name / Team)
____________
(Date)
Procedure
a. Have your Bio-informatics worksheet hand-out showing the 16S rRNA gene
sequence of an unknown bacterium ready and have a seat in front of the
computer workstation
- your instructor will hand this worksheet out at the beginning of this lab session
b. Use the internet-connected computer system and open the NIH/NCBI
(National Institutes of Health/National Center for Biotechnology Information)
home page by typing in following web address:
http://www.ncbi.nlm.nih.gov
c. In the upper section, mouse-click on the ‘BLAST’ icon to access the BLAST
program
d. Scroll down to the ‘Basic BLAST’ section to choose a BLAST program; click
on the ‘nucleotide blast’ hyperlink text to access the nucleotide-nucleotide
sequence analysis section of BLAST
e. In the ‘Enter Query Sequence’ window, type in the DNA sequence you
received with your worksheet hand-out
- work in teams of two and have one team member spelled the nucleotide
sequence to the one member sitting in front of the computer work station
- make sure that you do not make any type-in errors while doing this important job
f. Go to the ‘Choose Search Set’ section and under ‘Database’, click on the
‘Others (nr, etc.)’ icon to select the Genbank database for your query
g. Click on the ‘BLAST’ search icon at the bottom of this page to start the
sequence similarity search of your submitted bacterial 16S rRNA gene
sequence within the GenBank database
h. The data base search will take some time, but after a couple of seconds you
should receive a similar result report including an overview graphic similar
to the one shown in Graphic 2 below
i. Analyze these results carefully and answer the following questions below.
j.
After you are done with your analysis (and hopefully got an idea from which
bacterium the DNA was isolated) turn the completely filled-in pages 10 & 11
in as part of your weekly lab report
(Don’t forget to put your name and the date on the 2 sheets)
9
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
_____________
(Date)
______________________
(Student Name)
What was the ID number of the DNA sequence you submitted to BLAST?
DNA Sequence ID #: _____
How many BLAST hits (if at all) did you get with your 16S rRNA query
sequence?
___________ hits
Scroll further down to the ’Sequences producing significant alignments’ section of
the BLAST results data sheet; look up the “hit list”, i.e. the list showing the
microorganisms with the highest sequence homology; the bacterium named on top of
this list is the one with the best match to your submitted sequence (= query sequence)
Write down following pieces of information about your best matching bacterium, which
is, the bacterium
I. What is the name of the top-scoring bacterium?
_____________________________________
II. Which gene is matched, i.e. has the highest sequence homology with your
submitted 16S rRNA gene (query) sequence?
_______________ gene
III. What is the best (= highest) maximum and total score of the best
matching bacterium?
Max Score
= _______
Total Score
= _______
IV. What is the GenBank accession number of the highest scoring
bacterium?
Accession Number: _______________
Now click on the ‘Accession’ number hyperlink of your best matching bacterium and
retrieve further information about this microorganism.
10
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
V. Who deposited the DNA sequence of the bacterium your submitted DNA
sequence has the highest sequence homology with? And when?
Depositor(s): ___________________________________
Year:
___________________________________
VI. What else can you say about the bacterium your DNA sequence has the
highest DNA sequence homology with? Try to retrieve further information
about it, e.g. origin, source from which the bacterium was isolated from,
where, when, etc.
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
VII. Now, what can you speculate about the nature of your unknown
bacterium from which you isolated the DNA and did the 16S rRNA gene
analysis?
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
__________________________________________________________________
11
SAN DIEGO MESA COLLEGE
Introduction Molecular Cell Biology Laboratory (Bio210A)
Instructor: Elmar Schmid, Ph.D.
nBLAST Result of Nucleotide Sequence Homology Search
with the Thermus thermophilus (Tt) SSB gene sequence
Query Sequence
(= Tt-SSB gene)
1. Best Match
2. Best Match
Low Sequence
Homology
Matches
12
Download