A Six-Week Project in Molecular Biology and Bioinformatics

advertisement
This article reprinted from:
Robinson, D. L., J. M. Lau, S. Porter, B. S. Wiseman, and M. Woodrow. 2009. Modular
cloning and sequencing: A six-week project in molecular biology and bioinformatics.
Pages 111-183, in Tested Studies for Laboratory Teaching, Volume 30 (K.L. Clase,
Editor). Proceedings of the 30th Workshop/Conference of the Association for Biology
Laboratory Education (ABLE), 403 pages.
Compilation copyright © 2009 by the Association for Biology Laboratory Education (ABLE)
ISBN 1-890444-12-X
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise,
without the prior written permission of the copyright owner. Use solely at one’s own institution with no
intent for profit is excluded from the preceding copyright restriction, unless otherwise noted on the
copyright notice of the individual chapter in this volume. Proper credit to this publication must be
included in your laboratory outline for each use; a sample citation is given above. Upon obtaining
permission or with the “sole use at one’s own institution” exclusion, ABLE strongly encourages
individuals to use the exercises in this proceedings volume in their teaching program.
Although the laboratory exercises in this proceedings volume have been tested and due consideration has
been given to safety, individuals performing these exercises must assume all responsibilities for risk. The
Association for Biology Laboratory Education (ABLE) disclaims any liability with regards to safety in
connection with the use of the exercises in this volume.
The focus of ABLE is to improve the undergraduate
biology laboratory experience by promoting the
development and dissemination of interesting,
innovative, and reliable laboratory exercises.
Visit ABLE on the Web at:
http://www.ableweb.org
Modular Cloning and Sequencing:
A Six-Week Project in Molecular Biology
and Bioinformatics
Dave L. Robinson1, Joann M. Lau1, Sandra Porter2,
Bryony S. Wiseman3, and Melissa Woodrow3
1
Department of Biology, Bellarmine University,
2001 Newburg Road, Louisville, KY 40205
drobinson@bellarmine.edu
jlau@bellarmine.edu
2
Geospiza Inc., 100 West Harrison,
North Tower, Suite #330, Seattle, WA 98119
sandy@geospiza.com
3
Biotechnology Explorer Program, Bio-Rad Laboratories,
2000 Alfred Nobel Drive, Hercules, CA 94547
Bryony_Ruegg@bio-rad.com
Biography
Dave L. Robinson received his B.S. and M.S. in plant science from the University of Arizona, and
his Ph.D. in plant physiology from the University of Minnesota. Now an Associate Professor of
Biology at Bellarmine University in Louisville, KY he has taught Principles of Biology, Plant
Diversity, Molecular Biology, Environmental Science, and Genetics, as well as seminar courses in
ethnobotany. He has served as Biology Department Chair as well as principal investigator on a 3year grant from NIH-NCRR. His research interests are in the physiology and genetics of weedy
plants like ragweed, dandelion, and white snakeroot.
Joann M. Lau received a Ph.D. from the University of Illinois Urbana-Champaign and a B.A. from
Bellarmine University. While at UIUC, she was a USDA Agriculture Genome Sciences and Public
Policy Fellow, she had also received the Colgate-Palmolive Graduate Fellowship, and the Eugene S.
Boerner Graduate Fellowship. She has taught Molecular Biology, Drugs and the Human Body,
Association for Biology Laboratory Education (ABLE) 2008 Proceedings, Vol. 30:111-183
112 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Introduction to Forensics Science, Modern Genetics, Introduction of Life Sciences, Principles of
Biology labs, Cell Biology labs, and Molecular Biology labs at Bellarmine University. Her research
currently involves developing new molecular biology exercises for the classroom, studying the
evolution of triple repeat diseases in non-human primates, the effects of Reishi mushroom on lung
cancer cell proliferation, and the expression of allergenic-related genes in ragweed.
Dr. Sandra Porter, the education director at Geospiza, Inc., has been a long time participant in
biotechnology and bioinformatics education. For several years, Dr. Porter ran the biotechnology
education program at Seattle Central Community College, and acted as the regional director for BioLink, an Advanced Technology Education center funded by the National Science Foundation. In
2001, Dr. Porter joined Geospiza, Inc., to work on an NSF-funded project to develop educational
materials that use bioinformatics resources in the same way that they are used by biologists, as tools
for understanding biology and doing experiments. During her years at Geospiza, Dr. Porter has
written and published two laboratory manuals, one CD, and a peer-reviewed study examining the
efficacy of using molecular viewing tools for teaching DNA structure, in addition to research papers
on genetic variation in the genes for clotting factors. She has recently completed an NIH-funded
project on applying the use of molecular viewing programs in student activities that explore alcohol
metabolism and human polymorphisms. Currently, she is writing a textbook on bioinformatics for
biology students. She also writes a blog called “Discovering Biology in a Digital World”
(www.scienceblogs.com/digitalbio).
Bryony Wiseman graduated in Molecular Biology from Glasgow University, Scotland. She won a
four year Imperial Cancer Research Fund studentship and received a Ph.D. in Biochemistry in 1999
from University College, London for her work on Ras and Raf signaling. She was awarded a U.S.
Dept. of Defense Breast Cancer Research Program post-doctoral fellowship to work on the
extracellular environment and mammary gland development at UCSF, San Francisco. She changed
focus in 2002 when she joined Bio-Rad Laboratories’ Biotechnology Explorer Program. She is
currently a Staff Scientist at Bio-Rad and develops tools and kits to help teachers teach
biotechnology to high school and undergraduate students.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
113
Contents
Introduction to cloning GAPDH gene (for instructor):
Bioinformatic Analysis of GAPDH sequences (for student):
Background Reading
Bioinformatic Exercises
Notes for Instructor:
Materials
Acknowledgement
Literature Cited
Appendix
Introduction
This project involves isolating (cloning) and analyzing a major portion of the gene for the enzyme
Glyceraldehyde 3-phosphate dehydrogenase (GAPDH) from an uncharacterized plant species. This
gene is considered a housekeeping gene (a continually-transcribed gene whose product is involved
in basic cell function). It codes for an enzyme that catalyzes an important step of glycolysis, a stage
of respiration that occurs in all living cells. Because of the central importance of GAPDH, this gene
occurs in all plants, as well as in all other organisms.
GAPDH is a crucial enzyme for all animals, protists, fungi, bacteria and plants. Therefore, there
are lots of opportunities to draw connections between the molecular aspects of GAPDH and its
biomedical, and evolutionary significance (Figge et al., 1999). For instance, the human GAPDH
gene has been found to be highly expressed in 21 different types of cancer and may play an
important role in future cancer therapies (Altenberg and Greulich, 2005). Others have reported that
GAPDH is a multifaceted protein that may be involved in regulating transcription and programmed
cell death, and may have a role in age-related diseases like Alzheimer’s and Huntington’s Disease
(Kim and Dang, 2005; Sirover, 1999). GAPDH is also thought to have roles in DNA replication and
repair, cytoskeletal organization, and phosphotransferase activity (Tatton et al., 2000).
Aerobic respiration is composed of four basic stages: glycolysis, formation of acetyl coenzyme A,
the citric acid cycle, and the combined processes of electron transport and chemiosmosis. It is this
first stage, glycolysis that is the most uniform and unwavering in its occurrence in the biological
world. Cells undergoing anaerobic respiration due to lack of oxygen, like fermenting yeast or
vigorously exercised muscles, for instance, only carry out glycolysis. Prokaryotic organisms
(bacteria) that do not contain mitochondria (where the latter three stages occur in eukaryotes) still
carry out glycolysis.
Glyceraldehyde 3-Phosphate (GAP) is invaluable to the second half of glycolysis (Dennis and
Blakely, 2000). One of the unique features of GAP is its ability to have a phosphate added to it
without having to consume an ATP. The product of the GAPDH reaction, 1,3-Bisphosphoglycerate
114 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
(BPG) is such a high-energy molecule that it is eventually acted upon by the other enzymes of
glycolysis to yield two ATP molecules. After the loss of its two phosphates (for making ATP from
ADP), and other structural alterations, that three-carbon molecule becomes pyruvate which can be
actively transported into the mitochondria for the other stages of aerobic respiration.
GAPDH Reaction: Glyceraldehyde 3-Phosphate + NAD+ + Pi → 1,3 Bisphosphoglycerate + NADH +
H+
Another important feature of the GAPDH reaction is the generation of reducing power in the
form of NADH. This coenzyme provides reducing power to hundreds of different enzymes involved
in catalyzing oxidation-reduction reactions in the cell.
The DNA sequence for the GAPDH gene that results from this exercise is important for several
reasons. The basic structure of the GAPDH gene can be examined by looking at the DNA sequence.
For instance, the location of specific introns and exons can be predicted using readily-available
software.
The amino acid sequence of the GAPDH gene product can also be predicted. Evolutionary
differences between organisms can be studied by comparing plant GAPDH genes isolated by
researchers working with other species (Olsen et al., 1975). The biochemical characteristics (e.g. the
active site) can also be deduced by aligning the sequences from numerous species and looking for
areas showing high levels of consensus.
When the DNA sequence for a housekeeping gene, like GAPDH, is elucidated in relatively
unstudied species it provides novel information about that plant that can be useful to other biologists.
Whereas past researchers have focused more on studying the genomes of model species (like
fruitflies, yeast, mice, Arabidopsis) information about lesser known species can be quite valuable in
filling in the evolutionary gaps. Our goal is to see these unique sequences published in the GenBank
(Benson et al., 2007) with teachers and students as a co-author (see Accession number DQ075672).
The steps* to be taken in this project are as follows:
Week 1:
1. Identify the plant species to be studied
2. Extract genomic DNA
3. Initial amplification the GAPDH gene using the polymerase chain reaction (PCR)
Week 2:
4. Exonuclease treatment
5. Nested PCR
Week 3:
6. Clean PCR product
7. Blunt PCR product
8. Ligate (insert) the GAPDH gene fragment into a plasmid vector
9. Prepare competent cells
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
10. Transform the plasmid into bacteria via heat-shock transformation
Week 4:
11.
12.
13.
digest
14.
gene
Select and multiply the bacteria containing the recombinant plasmid**
Isolate the plasmid from the bacteria
Confirm the presence of the plant gene in the plasmid by restriction enzyme
Prepare plasmid for DNA sequencing and obtain the sequence of the GAPDH
Weeks 5 and
15.
analysis
gene
* Assumes 3-hour
** Students select
day before lab
Workflow of complete activity:
6:
Perform bioinformatic
of the plant GAPDH
lab periods
colonies at least one
115
116 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Stage 1: DNA Extraction
This project is an opportunity to perform novel research - to clone and sequence a gene that has
not yet been categorized - and to add that information to the body of scientific knowledge about
GAPDH. The first step in this exercise is to choose an interesting plant species to work with. Some
model species (for example Arabidopsis thaliana or Chlamydomonas) or crop plants (like rice and
wheat) have already had their genomes sequenced but you may choose to reproduce and confirm this
data. Alternatively, you might choose to select a species that is less studied. There are over 250,000
plant species known to exist on the planet providing plenty of options to work with. Also, you could
choose a variety or cultivar (within a species) that no one has examined yet.
Background
In order to clone a gene from an organism, DNA must first be isolated from that organism. This
genomic DNA is isolated from one or two plants using column chromatography. Plant material is
weighed, and the material is ground in lysis buffer with high salt and protein inhibitors using a
micropestle. The solid plant material is removed by centrifugation, ethanol is added to the lysate and
the lysate is applied to the column. The ethanol and salt encourages DNA to bind to the silica in the
chromatography column. The column is then washed three times and the DNA is eluted using sterile
water at 70oC.
For PCR to be successful the DNA extracted needs to be relatively intact. The best sources for
DNA extraction are green leaves, but fruit, roots, or germinating seeds should also suffice. It is
better to use tissue that is relatively young and still growing, as the ratio of nucleus: cytoplasm will
be more favorable, the cells walls will be thinner, and the amount of potentially harmful secondary
products will be less. There are two features of plants that make DNA extraction different from
animals. First, plants have a tough cell wall made of cellulose that has to be penetrated. This is easy
to do with vigorous grinding using a mortar and pestle. Secondly, a major part of every plant cell is
a vacuole that contains acids, destructive enzymes (including nucleases) and unique secondary
compounds (chemicals produced from pathways that are not part of primary metabolism) that might
potentially damage DNA.
Since plants are immobile and cannot easily escape from herbivores and pathogens they produce
a myriad of different secondary compounds to defend themselves. Animals, and bacteria, don’t
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
117
typically produce as many interfering secondary compounds as plants do. Although some of these
phytochemicals can be toxic others have proven very useful. We wouldn’t have the herbs, spices,
natural flavorings and numerous medicines we have if it weren’t for these secondary metabolites.
Vacuoles are the ‘garbage-dump’ of the cell as plants cannot excrete wastes the way animals do.
Since 90% of the cell volume consists of vacuole - that is a lot of waste. With the pH in the vacuole
being 5.0-5.5, and there being lots of harmful chemicals (as well as nucleases) in it, this organelle
can be problematic when doing DNA extractions because it is impossible to break open plant cells
and nuclei without also breaking the vacuole. To minimize contaminants from the vacuolar
contents, salts and other inhibitors need to be added to the lysis buffer.
Step 1- DNA Extraction Protocol
Procedure can be obtained at explorer.bio-rad.com
Stage 2: GAPDH PCR
The overall purpose of this experiment is to clone a portion of the Glyceraldehyde 3-phosphate
dehydrogenase (GAPDH) gene. Because it is a vital metabolic enzyme involved in one of the most
important of biological processes (glycolysis) the GAPDH protein is highly conserved between
organisms, especially in vital domains of the enzyme such as the active site. However, this does not
mean that the gene sequences are identical in different organisms. Much of a gene’s DNA sequence
does not code for protein. This intronic DNA is not subject to the same selective pressures as DNA
that codes for protein. In addition, within exons (gene sections that encode for proteins) there is
degeneracy of the genetic code such that different DNA triplet codons encode the same amino acid.
Also, regions of the enzyme that are less vital to function (non-active sites, for instance) might not
experience the same degree of selective pressure as more important regions. Although there is still
conservation of protein sequence in these regions, it may not be as stringent as in others.
Background
To clone a known gene from an uncharacterized organism, PCR primers (synthetic singlestranded oligonucleotides 17-25 bases long) must be designed that are complimentary to conserved
regions of
the
GAPDH
gene.
Even
conserved
regions are
not
identical
between
organisms,
however.
A best
guess
of
the
gene
118 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
sequence is made using a comparison alignment from the sequences of the GAPDH gene from
different-related organisms, with the understanding that the primers will not be an exact match to the
sequence and may amplify non-specific sections of DNA in addition to the target sequence. A
second set of primers has been designed (interior to the first set of primers) and used to amplify the
PCR products from the first round of PCR. This is called nested PCR and is based on the extremely
slim chance of non-specific amplified DNA also encoding these interior sequences. In other words,
if the wrong fragment is amplified with the first primers, the probability is quite low that the wrong
fragment will be amplified during the second round of PCR. As a result, the PCR products
generated from nested PCR are very specific. Since the nested PCR primers are interior within the
first fragment, the PCR products generated during the second round of PCR are shorter than first
one. See diagram below for an illustration of nested PCR.
The strategy for this experiment uses nested PCR to amplify portions of the GAPDH gene we
want to study. In the initial PCR reaction, a set of degenerate primers is used. These are primers
that contain more than one DNA nucleotide base at specific positions increasing the likelihood that
the primer will bind. Then in the nested PCR reaction a more specific set of primers will amplify
GAPDH. The initial PCR reaction uses Primer Set 1, while the second round of PCR, the nested
PCR reaction, uses Primer Set 2. It is very important not to reverse the order of the primers or to
mix the two primer sets together.
Arabidopsis genomic DNA has been included as a control for these PCR reactions. In addition, a
plasmid encoding the initial PCR product amplified from Arabidopsis genomic DNA acts as a
second control.
As each PCR reaction takes approximately 3-4 hours to run, it is more practical to run the PCR
reactions on separate days. Since the reagents used in these experiments function most optimally
when prepared fresh, it is highly recommended that the reagents be prepared just prior to setting up
the PCR reactions.
Step 2A- Initial PCR Reaction Protocol
Procedure can be obtained at explorer.bio-rad.com
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
119
Stage 2B: Nested GAPDH PCR
Background
In this next experiment, PCR products generated in the previous stage will be further amplified
(i.e. serves as the template) in a second round of PCR. However, before performing the nested PCR,
the primers that were not incorporated into the PCR product must be removed so that they do not
amplify target DNA in the second round of PCR. To do this an enzyme that specifically digests
single stranded DNA, Exonuclease I, is added to the PCR reactions. In nature, this enzyme is
involved with proofreading and editing newly-synthesized DNA. Exonuclease I needs to be
inactivated before it is introduced into fresh PCR reactions containing Primer Set 2 to prevent it
digesting those new primers.
Following Exonuclease I treatment, PCR products generated in the first round of PCR is
amplified using the nested primers. Plasmid DNA will also be amplified in this step to serve as a
positive control for PCR. A no-template negative control is also run.
Step 2B- Nested PCR (Second Round) Protocol
Procedure can be obtained at explorer.bio-rad.com
Stage 3: Analysis of PCR Products by Agarose Gel Electrophoresis
Background
To assess PCR success the products of both the initial and nested PCR reactions are analyzed by
agarose gel electrophoresis. The procedure for this can be found on-line at www.bio-rad.com
Analyzing results from the PCR
Students examine and interpret their gels. An example is given below. A+ = Arabidopsis
genomic, positive control; P+ = plasmid, positive control; ─ = water, negative control; Plants 1-6. I
= initial PCR; N = nested PCR.
120 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
The portion of the GAPDH gene that has been targeted for PCR varies in size between plant
species. The expected size of the fragment from the first round of initial PCR should be between 0.8
and 2.5 kb. The size of the fragment from the second round of nested PCR is expected to be slightly
smaller than the PCR product from the initial round of PCR. It is possible that some plants may
amplify doublets - two DNA fragments of similar sizes. This is probably due to amplification of two
GAPDH genes that are very homologous (genes that share similar structures and functions that were
separated by an ancient duplication event). Students construct a table of their results, addressing
issues like number and relative intensity of the bands.
We recommend that a laboratory class of 16 students attempt to work with only one or two plant
species. Cloning the same gene multiple times will provide significant depth of coverage. This will
help resolve any ambiguities when the gene sequence is prepared for eventual publication in the
GenBank. Remember, the ultimate goal of this laboratory is to provide new data for the scientific
community at large, thus it is vital that the data be as correct as possible.
It is recommended that the plant species chosen for this project be the one that generates the
cleanest PCR product (fewest background bands) with strong band intensity of an appropriate size.
It is acceptable to clone doublets since each plasmid is expected to ligate a single DNA fragment.
Be aware that this may mean that two different gene sequences are obtained, however.
Stage 4: Purification of PCR Products
The next step, after generating DNA fragments, is to find a way to maintain and sequence these
products. This is done by ligating (inserting) the fragments into a plasmid vector (small circular
pieces of double-stranded DNA found naturally in bacteria) that can be propagated in bacteria. To
increase the success of ligation, it is necessary to remove unincorporated primers, nucleotides and
enzymes from the PCR reaction. This is done by using size-exclusion column chromatography.
In this method, small molecules (like proteins, primers and nucleotides) get trapped inside the
chromatography beads, while large molecules (like DNA fragments) are too large to enter the beads
and thus pass through the column into the micro-centrifuge tube.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
121
Background
Following PCR, the amplified product needs to be purified using spin columns supplied in this
kit. The purpose of the purification step is to remove unincorporated dNTPs, Taq Polymerase,
primers and small primer-dimers so that the DNA can be successfully digested with a restriction
enzyme and ligated to a vector. This procedure uses small spin-columns that fit into microcentrifuge tubes. These columns contain a matrix composed of miniscule beads having numerous
microscopic pores. These porous beads significantly increase the surface area of the matrix. Any
solutes that are added to the matrix will diffuse in and out of the pores very readily, but only if the
solutes are small enough to enter the spaces. The porous spaces occurring in this matrix are
designed to be large enough to hold onto smaller molecules (like dNTPs and primer-dimers), but too
small to hold onto larger molecules (like PCR product). In this scenario, the PCR product that we
want to clean will not be detained by the porous matrix because it is hundreds of DNA bases in
length. When the PCR product is applied to the top of the columns the molecules that we want to
exclude (like dNTPs, Taq Polymerase, primers) will tend to be captured by the microscopic pores
and the larger molecules (our DNA product) will be pushed through the matrix much more readily.
DNA strands between the sizes of 32 bp and 200 bp should be eliminated, thus yielding a very pure
GAPDH PCR product.
Without this cleaning step we could be less successful in the next steps of the cloning process.
The opportunity also exists here to run gel electrophoresis of samples of PCR product before and
after cleaning to demonstrate the efficacy of the spin-column cleaning.
At this step it is helpful to know the DNA concentrations of both the clean, digested PCR product
and the vector. This can be determined using a fluorometer, spectrophotometer or by using a DNAdye assay. These concentration values are used in determining the amounts of DNA used in the
subsequent ligation reaction.
Step 4- Purification of PCR Product Protocol
Procedure can be obtained at explorer.bio-rad.com
Stage 5: Cloning - Ligation and Transformation
Background - Ligation
In this stage the PCR product will be inserted (ligated) into a plasmid vector. The plasmid is
supplied and has already been opened-up to receive the fragment. However, prior to ligating the
fragment into the plasmid the PCR fragment must first be treated to remove the single adenosine
nucleotide that is left on the 3´ ends of the PCR fragment by Taq DNA polymerase. This is
performed by a proofreading DNA polymerase (enzymes with a 3´ proofreading exonuclease
domain that allows the polymerase to remove mistakes in the DNA strands). This polymerase
functions at 70oC, but not at lower temperatures, so it is not necessary to inactivate this enzyme after
use.
122 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Once blunted, the PCR fragment is combined with the plasmid. A T4 DNA ligase (an enzyme
that catalyzes the formation of phosphodiester bonds between the 5´-phosphorylated PCR fragment
and the 3´-hydroxylated blunt plasmid) is added and the ligation reaction is completed in 5-10
minutes.
Background - Transformation
During ligation many different products are produced. In addition to the desired ligation product
where the PCR fragment is inserted into the plasmid vector, the vector may re-ligate, the PCR
product may ligate with itself, etc. Relatively few of the DNA molecules formed during ligation are
the desired combination of insert + plasmid vector. To separate the desired plasmid from other
ligation products, and also to have a way to propagate the plasmid, the ligation reaction is
transformed into bacteria. Plasmid vectors are natural bacterial plasmids that have been genetically
modified to make them useful to molecular biology researchers. In order to get a plasmid into
bacteria, the bacteria must be made competent. To make bacteria competent they must be actively
growing, ice cold and suspended in transformation buffer that makes them porous and more likely to
allow entry by plasmids.
In this stage, bacteria are grown in culture media so that they are actively growing. Then they are
pelleted, cooled and resuspended in transformation buffer two times to ensure they are competent. It
is vital to keep bacteria on ice at all times. Bacteria are then mixed with the ligation reaction and
plated on warm LB Ampicillin agar plates that will only permit bacteria expressing ampicillinresistance genes (encoded by the pJet1.2 plasmid) to grow. These plates also contain isopropyl β-D1-thiogalactopyranoside (IPTG) which induces expression of the ampicillin-resistant gene. Plates
are then incubated at 37oC overnight. To confirm that cells were made competent by this procedure,
a control plasmid is also be transformed.
Step 5- Preparation of Competent Cells, Ligation and Transformation of PCR Product into
Plasmid Protocol
Procedure can be obtained at explorer.bio-rad.com
Analysis of results of ligation and transformation
Students count the number of bacterial colonies growing on both their control and transformation
plates.
Stage 6: Isolation of Plasmid DNA
Background
It is necessary to analyze the plasmids that have been successfully transformed to verify that they
have
the
PCR
fragment inserted. To do this, a
sufficient amount of
plasmid DNA is obtained by
growing a small
culture of bacteria, purifying the
plasmid from the
bacteria,
and
performing
restriction enzyme
digests. Restriction enzymes cut
double-stranded
DNA at specific recognition
sequences on the
plasmids. The plasmid used to
ligate the PCR
products is pJet1.2 which
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
123
contains Bgl II restriction enzyme recognition sites on both sides of the insertion region (figure
below). Thus, once the plasmid DNA has been isolated, a restriction digestion reaction is performed
to determine the size of the insert.
Step 6- Purify and Analyze Plasmid Mini-preps Protocol
Procedure can be obtained at explorer.bio-rad.com
Analyzing results from the restriction digest
Students are expected to interpret their gels and prepare samples for sequencing. An example of
a gel is given below. Each lane represents a different clone that was restriction digested.
Stage 7: Set up Sequencing Reactions
To study this newly-cloned GAPDH gene the recombinant plasmids need to be sequenced. Like
PCR, sequencing reactions rely on the basic principles of DNA replication and as such require
primers to initiate the replication. However, sequencing is performed in just one direction and so
instead of a primer pair, sequencing makes use of single oligonucleotides. Each sequencing reaction
124 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
proceeds in a single direction, so two sequencing reactions are set up for each plasmid - one forward
and one in reverse.
In this lab, plasmid DNA will be combined with sequencing primers (primers that anneal to the
target DNA at a known location on a specific vector DNA strand) and then mailed to a sequencing
facility which will perform the sequencing reactions and then send the DNA sequences back for
analysis.
Background
The technique for determining the exact order of As, Ts, Cs and Gs in cloned DNA is the
Dideoxy Method. This method, called the Sanger Method, is named for Dr. Fred Sanger of
Cambridge, England who invented it in the mid-1970’s (Sanger and Coulson, 1975). In this
approach, the plasmid clones are heat denatured (to separate the complementary DNA strands with
high temperature), and used as template to synthesize new strands of DNA. To achieve this, the
template is incubated with commercially available DNA Polymerase, sequencing primers,
deoxynucleotide triphosphates (dATP, dTTP, dCTP, dGTP), and relatively small amounts of
dideoxynucleotide triphosphates (ddATP, ddTTP, ddCTP, ddGTP). The difference between the
deoxy- form and the dideoxy- form of nucleotide triphosphate is a missing -OH group on the 3´
carbon of the deoxyribose. This missing hydroxyl group is necessary for normal DNA synthesis, so
if a growing chain of DNA happens to utilize a ddNTP, instead of a dNTP, the DNA synthesis
reaction is stopped. This termination event occurs rarely enough that all possible lengths of DNA
get synthesized during the process. For instance, a sequencing reaction that provides the DNA
sequence for 800 bases would essentially involve synthesizing 800 different strands of DNA,
covering the entire range of possible lengths. These different lengths of DNA are resolved by
electrophoresis (frequently capillary gel electrophoresis) and visualized. A common visualization
method is to ‘end label’ each of the four types of dideoxynucleotide triphosphate with a different
fluorescent dye that can be distinguished with a digital camera during electrophoresis. DNA
sequence output is based on the fact that longer lengths of DNA move more slowly during
electrophoresis than shorter lengths, and that the digital camera can detect the color of the
fluorescent dye that labels each of the bands. Since the specific color of the dye attached to each of
the different ddNTPs is known, and since that specific ddNTP will end-label the growing DNA
strand only where its complement occurs on the plasmid template, the task of correlating the order of
colors with a specific sequence of DNA is relatively straightforward.
This method detects and records the dye fluorescence and shows the output as fluorescent peaks
on a chromatogram.
The primers used for DNA sequencing are different from the primers used to amplify the
GAPDH gene via PCR. Sequencing primers are designed to complement the DNA sequence of the
cloning vector, rather than the insert DNA. If the primers used for PCR were also used for
sequencing then part of the clone’s sequence would be missing because sequencing starts about 2050 bases away from the primer itself. This is a function of the size of DNA Polymerase. Most
commercially-available cloning vectors are designed to have sites that are relatively far from the
multi-cloning region and that will bind to widely-available sequencing primers. These universal
sequencing primers allow researchers to work with different cloning vectors at the same time.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
125
However, the current method used in DNA sequencing only generates ~500 bp of usable DNA bases
whereas the GAPDH gene in some species is much larger. Therefore, internal sequencing primers
are used to primer walk (primers designed to obtain a contiguous sequence from an internal region
of the gene of interest). In our case, double-stranded primer walking (providing contiguous
sequence information from both DNA strands) will be performed.
The U.S. Department of Energy’s Joint Genome Institute in Walnut Creek, CA has a Sequencing
Training Program that offers free DNA sequencing for classroom projects such as this. More
information is available at http://www.jgi.doe.gov/education/stp.html
Step 7- Set up Sequencing Reactions Protocol
Procedure can be obtained at explorer.bio-rad.com
Sequencing Reaction Plan
Plate Barcode Identifier____________________
Potential file names after conversion from 96 well plate to 364 well plate
A
(A,B
)
B
(C,D
C)
(E,F)
D
(G,H
E)
(I,J)
F
(K,L)
G
(M,N
H)
(O,P
)
1
2
3
4
5
6
7
8
9
10
11
12
(1,2)
(3,4)
(5,6)
(7,8)
(9,10)
(11,12)
(13,14)
(15,16)
(17,18)
(19,20)
(21,22)
(23,24)
126 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Bioinformatic Analysis of GAPDH Sequences
In this section, you will employ some of the
features of iFinch to review your experimental
results and obtain a preliminary identification for
your gene. iFinch contains both software and a
relational database. Both of these features are
involved in data management and analysis. People
work with iFinch through the web, the web pages
collect information, the information is analyzed by
software in iFinch, and it’s stored in a relational
database.
Other software programs pull
information out of that database, analyze it, and
present it in tables, and reports.
Cloning the GAPDH gene
•
Extract genomic DNA from plants
•
Amplify region of GAPDH gene using PCR
•
Assess the results of PCR
•
Purify the PCR product
•
Ligate PCR product into a plasmid vector
•
Transform bacteria with the plasmid
•
Select and grow bacteria containing the plasmid
The following types of analyses will be performed:
•
Isolate plasmid from the bacteria
1. Assess the quality of the data and view
sequence traces (Geospiza’s FinchTV™)
•
Sequence
•
Perform bioinformatics analyses with the sequence
data
2. Identify the cloned GAPDH sequences
3. Assemble sequences into a contig (CAP3
program)
either through the Joint
Genome Institute (JGI) or from a local DNA
sequencing service
DNA
4. Use your sequences to BLAST a nucleotide database (NCBI GenBank)
5. Identify introns and exons (adding annotations)
6. Translate the predicted mRNA sequence (in six reading frames) and check with BLASTx
7. Translate the mRNA sequence to predict the sequence for your protein
8. Phylogenetic analysis (NCBI)
Section 1. Review the Quality of the Data and View Sequence Traces
Extracting Sequences and Quality Assessment
One of the first iFinch programs extracts the quality values and base sequence from the
chromatogram file. In DNA sequencing, fluorescently labeled molecules of DNA are separated by
size through capillary electrophoresis. As the DNA moves through the capillary, it passes in front of
a detector that measures the intensity of fluorescence. Software in the sequencing instrument
processes that signal and identifies each base. The chromatogram file that is produced by the
sequencing instrument also includes information about the presence of other signals, their relative
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
127
positions, and their intensity. When this information is presented in a graph, it’s called a trace. At
the top of the trace, are the base calls. Each letter represents a base, with unidentified bases shown
as N’s. For each base, the height and shape of the peak corresponds to the signal intensity, and the
spacing shows the relative times when the signals were measured.
Many DNA sequencing instruments contain additional software that can evaluate the information
from the signal intensity, timing, and whether or not peaks overlap. This software provides
additional information about the quality of each base. A base is considered high quality when the
identity of the base is unambiguous. Other features used in quality measurement are the evenness of
the spacing between bases and the consistency of the signal strength. A high-quality region of
sequence has evenly spaced peaks that do not overlap and a signal intensity in the proper range for
the detection software.
In general, most base-calling programs define quality scores in the same way. The quality value
is inversely proportional to the probability that a base has been misidentified, thus a higher “quality
value” means that we can be more confident that the base is correct. If the “quality value” is low, we
are less likely to accept the base call. The equation for quality values is: Q = -10 log10(Perror), where
P = the probability of an error. If the chance of a mistake was 1 in 100, for example, P would be
0.01, and the “quality value” would be 20. The image below, from Geospiza’s FinchTV program
uses a blue line to mark the location of quality values equal to 20.
128 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Quality Trimming
The sequence of bases from each chromatogram is called a read. Once the read and quality
information have been extracted from the chromatogram or determined by the base caller, other
analytical programs can get to work. Since the data at the 5´ and 3´ ends of reads are often poor
quality, a standard step in DNA sequencing is quality trimming. In this process, a program
examines the quality of each base at the ends of each read. When a large enough fraction of the
bases have quality scores above a certain threshold, usually 20, the trimming program marks that
position at each end of the read and measures the length between trim points. Later, when we
download the DNA sequence, we can elect to have those portions trimmed or hidden.
In FinchTV, the trimmed regions are indicated by a grey shadow (shown below).
In the iFinch chromatogram report (below), the “quality value” for each base is shown in a graph.
Trimmed bases in the sequence have a strikethrough.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
129
Vector Screening and Masking
One other analytical program is important for our experiment. This program serves three
functions: vector identification, vector masking, and sequence screening.
The sequence
identification steps involve comparing the read sequence to a database of DNA sequences, obtaining
a score, and calculating the percent that match. In one step, the read is compared to a database of
sequences from cloning vectors. In the other, the read is compared to a set of reference genomic
sequences from the seven known Arabidopsis GAPDH genes. These analyses allow us to determine
which parts of a read are similar to GAPDH and if any parts of the read came from the cloning
vector. They also allow us to quickly evaluate our results and identify which genes were cloned and
in what proportions.
Vector masking is an optional step that can take place when data are downloaded from iFinch. If
this option is selected, each base in the vector region is replaced by a special letter that will hide it
from other DNA analysis programs.
On the previous page, a program in iFinch colored the vector sequences blue and sequences that
matched one of the GAPC genes red. When we view traces from iFinch in FinchTV, vector
sequences are shaded pink.
Relational databases and SQL
Because data are stored in a relational database, we have the ability to ask novel questions about
our data. When chromatogram files are loaded into iFinch, the data are stored in two ways. First,
they are stored in their original form. If you download chromatograms from iFinch they will be
identical in every way to the chromatogram files that were uploaded. Second, the information is
extracted and stored in an organized fashion in tables inside the database. A relational database
consists of several tables, based on a model of the data and the relationships between different types
of data (a read for example, is a DNA sequence that’s linked to a chromatogram). Each table
130 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
represents a different data type. We might have a table with the kinds of data that are found in
chromatograms and another table with the kinds of data that are found in reads.
Section 2. Assembling the Sequence and Correcting Mistakes with FinchTV
Assembly Programs and Why We Assemble DNA Sequences
In nature, DNA molecules are found in variety of sizes, many of which are quite large. Even
chromosome 21, the smallest human chromosome, is 47 million nucleotides in length. The smallest
Arabidopsis chromosome is 18.5 million nucleotides long. DNA sequencing technology however,
rarely produces sequences longer than 900 bases. Consequently, we can only find the sequence of a
longer piece of DNA by reconstructing it from smaller pieces. This process is called assembly or
sequence assembly. You will assemble the reads from your clones with a program called “CAP3.”
Many assembly programs, including “CAP3”, work by comparing all the sequences to each other,
calculating a score for each pair of sequences, and then merging the sequence pairs together,
working from the highest scoring pair to the lowest scoring pair, until all possible pairs have been
merged. The contiguous sequences that result from merging shorter sequences are called contigs. A
diagram of a contig is shown below. Some assembly programs like “CAP3” and “Phrap” can also
use quality information, when available, to help guide the assembly process. If there are positions
where the sequences disagree, these programs use the higher quality base as the choice for the
contig.
In genome sequencing, the next step that occurs is called finishing. Finishing is a process where
researchers examine the contigs to look for misassemblies or regions that require additional
coverage. That information may be used to identify errors, edit sequences and reassemble, or to
synthesize new primers and generate additional sequences to cover gaps and put contigs together.
One of the greatest problems in sequence assembly results from repetitive sequences. Repetitive
sequences, also known as repeats, occur at multiple positions in the genome and are closely related
to one another. These sequences complicate the assembly process because the assembly programs
find high scoring alignments between repetitive sequences and assemble them, even when they were
obtained from different parts of the genome. Finding errors that occurred from assembling the
wrong repeats together is still a challenge in genomic sequencing. Fortunately, Arabidopsis has few
repetitive sequences. Describing RepeatMasker is beyond the scope of this manual, but if you are
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
131
interested in screening your clones for repetitive sequences, you may use the RepeatMasker server at
the Institute for Systems Biology (www.repeatmasker.org).
Interpreting the Assembly Results
You will save two of the files from your assembly results, the contig sequence and the assembly
details. The “assembly details” file shows the contigs that were put together during the assembly,
lists the reads that were used in constructing each contig, and shows the positions of each read
relative to the contig sequence. Your “assembly results” will most likely show only one contig since
all of your clones should map to the same region of DNA.
When we view our “assembly details” file, we see this same information in greater detail. A
portion of an “assembly details” file is shown below with an alignment between three reads and the
consensus sequence. The consensus sequence (at the bottom) is the same as the contig sequence. If
quality values were supplied to the assembly program, the resulting sequence is a mosaic, put
together from the highest quality bases in each of the three reads. (In our case, the “CAP3”
assembly program that we’re using doesn’t have a place for entering quality values.) By convention,
the consensus sequence is shown in a 5´ to 3´ orientation. Above the consensus sequence are the
reads. If a read is in the same orientation as the consensus sequence, a + sign appears after the read
name. If the sequence of a read came from the other strand, “CAP3” determines the complementary
sequence and prints it in the reverse direction. We call this sequence the reverse complement.
Presenting the reverse complement of a sequence is helpful because it makes it easier for us to
spot positions where the reads disagree. These disagreements are called discrepancies. In the
image above, there’s a substitution discrepancy where two reads identify a base as an “A” and the
other read contains a “C.”
Another kind of discrepancy is called an indel. The word “indel” refers to either an insertion or
deletion. Indels are important because they can change the reading frame and make it harder for us
to predict the right protein sequence. The example below contains an indel. The read from
QCDP869377.b2_A01.ab contains an “A” and this base is missing (deleted) from
QCDP869381.b2_I01.ab.
132 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Which read is correct? We can answer this question by viewing the trace in FinchTV. In this
part of the project, we will use FinchTV to review the assembly results and investigate discrepancies
between reads. If we find a mistake in our contig sequence, we will edit our reads and reassemble.
Section 3. BLAST Searches
Introduction to BLASTn
BLAST stands for Basic Local Alignment Search Tool. (Altschul et al., 1990). The BLAST
family of programs is designed to find short (local) regions where pairs of sequences match.
Members of the BLAST family work in similar ways but for now, we’ll limit our discussion to
BLASTn. BLASTn is used to compare a nucleotide sequence to a database of nucleotide sequences.
To do this comparison, BLASTn breaks the query sequence (the sequence you’re searching with)
into ‘words’ of a defined length. Then BLASTn compares each word to a database of words,
derived from a set of nucleotide sequences. If all the letters in the words match perfectly, BLASTn
looks at each end of the word pair to see if the matching region might be extended and tries to make
the longest matching region that it can.
At the end of the search, BLASTn counts all the nucleotides in the matching region and awards
two points for every pair of bases that match. If one sequence has an insertion, a deletion, or a gap
(more than one base is missing) relative to the other, BLAST takes points away from the score. The
net result is that a BLASTn score is approximately equal to two times the length of the matching
region.
The BLAST score is one of the primary statistics used to evaluate a match. The other main
statistic is the E value. The “E value” represents the number of sequence matches that you would
expect to find (where a different database sequence matches your query just as well as the subject
sequence matched your query) if you were to search a database of random sequences. When E
values are below 1, they are equivalent to the probability that two sequences will match to a certain
extent. This would mean that if we have an “E value” of 0.01, then there is a 1% chance that we
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
133
would find an equally good match in a database of random sequences. Often E values are very low.
In fact, if we have a perfect match, the “E value” might be given as zero. It should be noted that
even though these values are shown as zero, they are never really zero. They’re just presented as
zero because the number of characters in the exponent might be too large to fit nicely in a table.
While low E values are good, high E values tell us that it might be possible to find an equally good
match by random chance.
Two additional factors have a strong influence on E values. These are the length of the sequence
and the size of the database. This is because it’s easier to find a perfect match to a shorter sequence.
It’s also easier to find a match in a larger database.
Understanding BLASTn Results
The results from a BLASTn search include many different kinds of information and statistics.
These bits of information include the size of the database, the length of each query sequence,
statistics that describe the number and percent of matching bases, a BLAST score, and the E value.
The sequences in this example (shown below) come from a GAPDH cloning experiment with a
plant from the genus Salvia (the common name is Sage). If you wish to use these reads and repeat
this test, these reads are in the “salvia1” folder in the GAPDH_test_data project, and they should be
available in your iFinch.
The first piece of information we see from BLASTn is a graph that shows a map of the query
sequence on top (in this case, a Salvia sequence), with colored boxes below to show where the
database sequences match and how well they match.
A key at top of the graph shows how the different colors correspond to the BLASTn scores.
Below the key is a thick red bar depicting our query sequence. In this case, the query was 766 bases
long. Below the query are different colored lines that represent matching sequences from the
database. The top line has six colored blocks (of varying length and color) that are attached by
thinner lines. Each block corresponds to a part of the database sequence that matched the query. It
looks like the subject sequence has the best match to the query between nucleotides 160 and 440 (of
the query).
The next information from BLASTn is a table that summarizes the statistics. Each row contains a
matching sequence with the best matching sequence at the top of the table. Notice, in this table, the
134 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
information in the Description column tells us that the matching sequences come from specific
chr
om
oso
me
s,
wit
h
the Accession number linked to the matching chromosome. We’ll find a more precise location for
our subject sequences and which GAP genes are matched in just a moment.
As we read across the table towards the right, the next column we come to is the Max score
column. Each of the colored blocks in the BLAST alignment graph above has been assigned a score
based on the goodness of the match. The “Max score” comes from the block of aligned sequence
that had the highest score. If we remember that the BLASTn score is about twice the number of
matching nucleotides, we can estimate that the maximum score for the top sequence either represents
113 matching bases or a longer region that contains gaps.
The next value is the Total score. The “Total score” is obtained by adding the scores from each
matching region. In our case, the “Total score” is not very helpful because it represents the total
from all the matching regions on a single chromosome. Since both chromosomes 1 and 3 contain
multiple copies of the GAPDH genes, this score isn’t informative.
In the next column, the Query coverage corresponds to the fraction of the entire query sequence
that’s matched by parts of the subject. Next, we have the E value. In the top row, the “E value” is
6 x 10-58 (the “e” in the table stands for “exponent”). This means that there is a 6 in 10-58 chance of
finding this match in a database of random sequences. In other words, a match like this is not likely
to occur by random chance; which is good. In other words, the lower the “E value” the better.
Moving farther right, the Max ident column shows the block of sequence that has the highest
percentage of matching bases. In this case, the “maximum identity” of any matching block is 100%;
however, when we scroll down and examine the matching regions in more detail, we find that the
only region where 100% of the bases match is only 18 nucleotides long. In this experiment, this
statistic is not
terribly useful.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
135
The last column contains links to other databases that are identified in a key above the table. In
the case of our results, the last column contains links to GEO. “GEO” is the Gene Expression
Omnibus database. This database stores information from microarray experiments.
To view our alignments, we can either scroll down the page, click the link in the “Max score”
column, or click a subject sequence in the alignment graph. The sequence alignments are organized
by subject sequence, with all the matches from one subject sequence grouped together. The sets of
alignments are presented in the order of the maximum score, with the first set containing the longest
and best alignment. This can be seen in the alignments below.
At the top of this set, we see the name of the subject sequence. It appears that the best match to
this query sequence is GAPC. We can confirm this by looking at the sequence location in the
Arabidopsis genome.
The numbers that are shown at the beginning and ending of each alignment correspond to specific
nucleotide positions in the sequences. For the subject sequence, these values are locations in a
specific chromosome. We can see from the map below that the sequence coordinates from the first
alignment, 1,082,841 and 1,082,587 placing this alignment within the GAPC gene on chromosome
3. We can conclude that the identity of this query sequence is GAPC.
136 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
In
other
cases,
the
results can be
more
ambiguous. In
the
image
below, we see BLASTn results from one of the other Salvia query sequences. The best matching
regions of the subject sequences (GAPC and GAPC-2) match this query almost equally well. In this
case, we have to decide which sequence is the best match by calculating the total score for all the
alignments between the subject sequences (within a specific gene) and our query.
To Analyze Your BLASTn Results
I
n
this
sec
tio
n,
you
wil
l
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
137
examine the two genes that your query sequence match best and calculate a “Total score” for the
match to each gene (if there’s a match at all).
For each query sequence, examine the graph and the coordinates of the matching subject
sequences (shown in the alignment) to determine which two genes match your query sequence the
best. To increase the certainty of your identification, you will need to determine the “Total score”
for the match between the gene and your query sequence. To make this easier, there are tables
provided in Exercise 3.2 that you can use for analyzing your results. Complete the information in
two tables (one for each of the two best matching genes) for each of your query sequences.
The tables show the four GAPC genes (GAPC, GAPC-2, GAPCP-1, and GAPCP-2), their
chromosomal locations, and the coordinates. For each query, use the two tables that correspond to
the two best matches.
Record the beginning and ending positions for each alignment, and the score for that alignment.
When you are through entering the information for each region that aligns within that gene, calculate
a total score for that gene by adding the scores for each region.
Example alignment data from a BLAST search with a Salvia query sequence (QCDP869377) and
an example of one completed table are shown below.
Example alignments:
For Query= QCDP869377.b2_A01.ab1 Chromat_id=2140 Length=766 Type=Folder Name=KSalvia1
Id=47
Length=766
Alignments to GAPC:
138 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Alignments to GAPC-2
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
139
140 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Example tables:
Query= QCDP869377.b2_A01.ab1
Gene
chromosome
Begin
End
Query
begin
Query
end
GAPC
3
1,082,841
1,082,587
183
437
226
1,082,344
1,082,244
666
766
87.8
1,083,030
1,082,920
6
116
75.2
Total score
Score for
query
sequence
match
389
Gene
chromosome
Begin
End
Query
begin
Query
end
GAPC-2
1
4,609,088
4,609,193
334
439
111
4,608,855
4,609,004
187
336
86
4,609,431
4,609,531
666
766
75.2
Total score
Score for
query
sequence
match
272.2
i. Use the total score that’s calculated from the table to identify the gene that your sequence
matches best. Repeat this for each of your query sequences.
ii. The result from the best match gives you the identity of your gene.
Conclusion from this example:
Our analysis shows that the identity of the sequence from chromatogram
QCDP869377.b2_A01.ab1 is most likely to be GAPC. This identification is supported by the total
BLASTn score of 389, which is higher than the next best match (GAPC-2), with a score of 272.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
141
Section 4. Adding Annotations: Finding the Intron-Exon Boundaries, Putting the mRNA
Together and Checking Prediction with BLASTx
Annotating
Most eukaryotic genes contain internal sequences that do not code for protein. During gene
expression, RNA splicing removes these sequences, known as introns, and fuses the sequences that
contain coding information (exons) together. This process is shown below.
Our goal in this stage of the project is to construct a gene model that shows where the exons are
likely to be located within our contig and to predict the likely amino acid sequence for the encoded
protein. The process of identifying the coding sequences and adding information to our work is
called annotating the sequence. The extra bits of information are called annotations.
Predicting splice sites in genomic DNA is an active area of research. We can build gene models
by aligning mRNA sequences to genomic DNA, but we do not understand enough about splicing
signals yet to predict splice sites accurately by computation alone. Ideally, we would identify our
splice sites by aligning our genomic contig sequence to a complete mRNA sequence from the same
gene, from the same species of plant. Since we are looking at new genes, however, those sequences
are probably not available. Luckily, we have an alternative. We can use the reference mRNA
sequences from other plants. These are mRNA sequences that have been reviewed by NCBI staff
and characterized.
We will begin this stage by aligning our contig sequence to the reference mRNAs with BLASTn,
then, we will use Microsoft Word® or another text editor to construct our gene model by putting the
sequence of our predicted mRNA together. We will test the gene model with BLASTn and make
corrections as needed. These steps will be repeated multiple times until we’re satisfied that the
model is correct. Finally, we will test the predicted protein sequence to determine if the splice sites
were identified correctly.
142 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
How to Interpret the Results and Predicting the Exon Positions
The BLASTn results will show the name of your contig and it’s length in nucleotides towards the
top of the page. Below, will be a graph showing where portions of the reference mRNA sequences
align to your contig. The ends of each alignment correspond approximately to the ends of each
exon.
The graph shown here was obtained by using the “salvia1” contig after reviewing the assembly
and editing the contig sequence to correct any errors.
From the graph and the table, the second mRNA, from the Arabidopsis thaliana GAPC gene, has
the longest and most extensive match. The first gene turns out to be an unidentified genomic
sequence from rice. We can predict from this graph that the exons on our contig are approximately
located between nucleotides 0 to 100, 225 to 400, 500 to 550, 700 to 800, and 1260 to 1300.
Now we will look for more precise exon/intron boundaries and mark the positions in the contig.
This will involve reformatting your BLAST results, preparing your contig sequence, and working
with the sequence data.
Section 5. Studying the Evolution of GAPDH
Phylogenetics
Phylogenetics is the study of the relationship between organisms. In the past, both anatomical as
well as fossil records provided evidence on how closely or distantly related organisms are.
However, with the advent of bioinformatics, molecular evidence (DNA and protein sequences) has
been used to compare homologous sequences and to build phylogenies based on these sequence
comparisons. The evolutionary relationship between taxonomic groups can be represented
graphically using an evolutionary tree also called a phylogenetic tree. Similar to a family tree that
illustrate descendants, family relationships, and successive generations; phylogenetic trees illustrate
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
143
the evolutionary relationships between taxonomic groups. This tree is based on the concept that
there are common ancestors. Evolutionary time is represented by the branch length.
Exercise 1.1 Examine Traces with FinchTV and Evaluating Data Quality.
1. Log
into
the
practice
iFinch
account
(http://www.geospiza.com/education/products/biorad.html) with the publically available user
name and password. User name: “BR_guest”
Password: “guest”
Computer ID:
____________
2. Find the folder with your class data by clicking the lined value or by clicking Folder from the
Chromats menu. Your Folder Name ________________________________
3. A page will appear with a list of folders. Click your folder label to see the data from your
plant.
4. A page will appear that presents some of the data from the folder. Each row contains data
from a
single chromatogram. Each column shows a different type of data.
144 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
iFinch tables also include tools for working with the data. You may:
•
sort the data by clicking any of the column headings that appear in blue
•
filter or view data with certain characteristics by using the finder at the top of the
table
view the data as an Excel® spreadsheet by clicking the Excel® icon
(view a trace in FinchTV by clicking the FinchTV icon)
5. Find the column labeled “Q20”. These data correspond to the number of bases in each read
that
have a “quality value” of 20 or greater.
•
6. Click the column to sort the data so that the lower quality data appear (low numbers of Q20
bases) at the top of the table.
7. Click one of the chromatogram labels to view the Chromatogram Read report. Notice the
graph of the quality values. The dotted line marks where the “quality value” is equal to 20.
Do you see many quality values above 20? ______
How many quality values are above 20? ______
How many quality values are below 20? ______
8. Open the first file by double-clicking on one of the FinchTV icons (located to the left of the
chromatogram labels).
What can you say about the quality of the bases at the beginning and end of the read? (You
may wish to click the wrap button
in FinchTV to scroll through the trace view more
easily).
Give the base numbers for the low-quality regions for all 4 files in your folder:
1) ________________________
2) ________________________
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
3) ________________________
9.
4) ________________________
You can also see the quality values for different bases. Click a base that you think is low
quality and look at the bottom left corner of the FinchTV window. You can see the quality
value for any base and the position where it appears within the sequence. Repeat this action
with a high-quality base.
Did the quality values match your expectations? Explain your answer.
Exercise 1.2 Examine Folder Reports and Statistics
Now that you know what Q20 values represent and understand data quality for individual bases and
reads, let’s look at your data.
1. Select the “Folders” link in the Chromats menu.
2. Click the label of your folder.
3. Click the “Folder report” link at the top of the table.
145
146 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
The Folder report presents a visual summary of the data in the folder. The Q > 20 histogram
presents the number of reads that contain varying numbers of good quality bases. These types of
graphs are helpful when you wish to compare sequencing results from different data sets.
Exercise 2.1
Assemble Your Sequences
In this portion of the project, you will assemble the FASTA files that were downloaded from iFinch
and stored in your folder. To do the assembly, you will use the “CAP3” assembly server at the
University of Lyon in France (http://pbil.univ-lyon1.fr/cap3.php). After the assembly, you will need
to review the assembly details file and determine if there are any discrepancies between the reads. If
you see discrepancies, you will need to review the trace files in FinchTV, edit the traces if necessary,
and reassemble the edited reads.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
1. Get your file with the FASTA-formatted read sequences from your iFinch folder. To do this,
click the “Folder link” in the Chromats menu.
2. Locate and click the link to your iFinch folder.
147
148 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
3. Notice the two tabs at the top of the table. Within each folder, files are organized by whether
or not they’re chromatograms. If you click the Chromatogram tab, you will see a list of your
chromatograms. If the Other tab is selected, this page shows a list of all the other files.
Lets start assembling your contig:
4. Open up your folder. Click on “Download Folder Data”. On the Download page click on
the “Export Sequences” button. If you have problems, Save it to the Desktop first and then
Open that file using Word®. You might have to Browse through different programs to find
Word®.
5. Open the file in a text-editing program (Windows). We are actually not going to save this!
6. We are going to copy-and-paste everything on this page, so highlight it and copy.
7. Go to the iFinch home page and click the “Sequence Assembly” link to access the University
of Lyon CAP3 web service (http://pbil.univ-lyon1.fr/cap3.php).
8. Paste the copied text into the text box in the assembly web form.
9. Click the “Submit” button to start the assembly.
10. When the assembly is complete, a page will appear with links to your results. These links
are:
a. Contigs
b. Single sequences
c. Assembly details
d. Your sequence file
We’re going to look at each of these in turn. Use the back button on the browser to return
to the assembly results page after viewing each page.
11. Click “Your Sequence File” first to make certain that you pasted the correct information into
the form.
12. Next, click “Single Sequences.” Sequences appear here if they could not be used in the
assembly.
13. Click “Contigs.” You will most likely see one, or possibly two, contig sequences. Save this
page as a text file (.txt). The name of this file should include last-name, plant-name, and the
phrase “contig-fasta.” Write that name here: _________________________________
14. Ideally, all your read sequences should be assembled into a single contig. If you end up with
multiple contigs, it’s possible that the sequences you assembled didn’t really belong together.
Other explanations could be poor quality data or mistakes by the assemble program. If you
do have multiple contigs, pick one for further analysis.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
149
15. Click “Assembly Details.” Save this page as a .txt file, with your name, plant-name and the
word “assembly.” If you saved this as a Word® file (.doc) be sure to change the left and right
margins to 0.4 inches each. Write that name here: __________________________________
Exercise 2.2 Store Your Multi-Sequence FASTA Files in iFinch
Before going forward, it will be helpful to store your downloaded contig FASTA file in iFinch. This
way, it will be ready later for sequence assembly.
1. To upload your assembly results, return to iFinch and click the “Upload” link in the
“Chromats” menu.
2. The Data “Upload” page will appear.
3. Choose “Generic” for the file type.
150 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
4. Click the “Browser” button to find your file on your computer.
5. Click the spyglass to locate your folder.
6. Click the “Submit” button to upload your file.
7. Confirm that the uploading options you choose were correct. The message should say
“Generic file”, containing your file name (your names, plant-name and the word
“assembly”).
* Note: If the wrong file type was chosen or the wrong folder, you will not be able to find
your data in your folder later and you will need to upload the files again.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
151
Exercise 2.3 Review the Assembly with FinchTV
1. Find your contig and assembly details file in your iFinch folder.
2. Click the filenames to download the files to your computer, if they aren’t already there.
3. Open the “assembly details” file with any text-editing program or Microsoft Word®.
4. Print this file so that you can mark discrepant positions with a highlighter for review.
5. Open up your iFinch folder.
6. Begin editing by finding and reviewing indels. To find indels in your assembly details,
search your assembly file for “_” characters. Note positions where indels are located.
7. Review the sequences by eye to identify other positions where the bases disagree. Note the
positions of these discrepancies.
8. To review a read, find the sequence in your folder and click the “FinchTV” icon to open the
read in FinchTV.
9. Examine the assembly detail file to determine how the read is oriented relative to the
consensus sequence.
• If the assembly details show a “+” after the name of your read, then your read
sequence will be the same sequence that is shown in FinchTV.
• If there’s a “-“ sign after the name of your read, you will need to obtain the reverse
complement for your sequence. If that’s the case, click the FinchTV reverse
complement button
, to get the reverse complement of the read.
10. Now, copy a portion of the sequence that appears on the 5′ side of the discrepancy as shown
in the example below.
11. Next, paste your copied sequence in the FinchTV “Find Sequence” window.
152 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
12. Click the return key (or enter) key to find the sequence and highlight it as shown below.
We can now examine the problem base. Looking at the trace, it’s pretty clear that one of the
reads contained an error and the indel should have been an “A.” This should be changed. If you
can’t decide which read is correct just let them remain the way they are. Don’t try to force these
sequences to match. We have other clones from this species to compare them to.
13. If other reads align in a questionable region, check those traces too, and confirm your
results.
To do this, we find the other read that align at that position, and showed the deletion, and
check
the base in the same way. You can have several reads showing on the computer at the same
time to make this job easier.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
153
When we look at the other read, we can see that it does have an “a” at the discrepant
position but it was missed by the base caller. So we can be confident that the “A” is
correct.
Exercise 2.4 Edit Reads with FinchTV
In the previous example, our consensus sequence was correct even though one of the reads contained
a mistake. What should we do if the consensus sequence is mistaken? If the consensus sequence is
wrong, we need to correct the mistake and reassemble the reads. We can correct the mistake by
editing the read sequence in FinchTV.
1. Find the read that you need to edit.
2. Open the trace in FinchTV and locate the questionable region of sequence.
3. To change a base, click the base to highlight it and type a new base.
4. To insert a base, click the base on the 3′ side of the insertion point. Select “Insert before
base” from the edit menu and enter the base.
5. To delete a base, click the base and click the delete button on your keyboard.
to save
your edits back to the iFinch database. This process creates a record of any changes that are
154 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
made to the read sequence. Any edits can be reviewed by selecting the “View Revision
History” link found on the “Chromatogram Read” page.
6. Continue reviewing and editing all the discrepancies in your assembly until you are satisfied
that each read shows the correct sequence. Exit the file when done; this will prompt the
computer to save the changes for you.
Exercise
Reassemble
Reads
2.5
Edited
If you found cases where the correct sequence was different from the consensus, and you edited the
read file in FinchTV, you can obtain a corrected contig sequence by carrying out another assembly.
Note: your new assembly results will show edited bases as lower case letters. Follow the same
procedure as in Exercise 2.1 (University of Lyon’s website), and save your files back to your folder
in iFinch. Give this file a different name “contig-edited” with your name and plant.
Review Questions
1. What is a contig?
2. Why is it important to have multiple reads of the area of a gene when assembling a contig?
3. What does it mean if a read has a “-“ sign after the name?
Exercise 3.1 Download FASTA Formatted Sequences and Use BLASTn to Verify the Identity
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
155
As previously mentioned, distantly-related sequences might not match the Arabidopsis sequences
that we’ve put into iFinch for screening. To ensure that you’re finding everything you can,
download the FASTA sequences from your high quality clones and compare them to the GAPDH
sequences in the GenBank. The same thing can be done with the contig assembly. A FASTAformatted sequence, is a sequence begins like this:
>sequence_name
GAGAGAGATAGATAGATAGATAGAGAGCCCCGCGAG
The sequence begins with a “>” symbol, then the sequence name without spaces, a new line, and the
DNA sequence in text without paragraph formatting. If your contig does not begin with a “>”
symbol, arrange it so it does.
Our objective here is to verify the identity of our contig and to examine its similarity to published
sequences.
1. Click the “Geospiza Finch” in the top left corner to return to the iFinch home page.
2. Click the link to “NCBI BLAST”. In the BLAST page, look under the BLAST heading and
choose “nucleotide BLAST”.
3. Click the “Browse” button on the form to locate your contig sequence file and upload it (see
example below).
4. Next, open up the database pull-down list and select “Reference genomic sequences
(refseq_genomic)”.
We need to use the genomic database because these subject sequences will include genomic
coordinates, where others will not. We need those coordinates in order to distinguish between
the different members of the Arabidopsis GAPDH gene family and determine which one gene
is most similar to ours. If we use other databases, we will not be able to distinguish between
matches to the different members of the GAPDH gene family.
5. Click the button to choose “BLASTn” as the program. BLASTn is more sensitive than the
other programs and allows you to find “somewhat similar sequences”.
Note: BLASTn differs from megablast and discontiguous megablast by using a smaller
word size. As consequence, BLASTn is more sensitive and can find more distantly related
sequences. Of course there’s a price for sensitivity, BLASTn is also more likely to find
matches that occur by random chance.
156 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
6. Type “plants” in the Organism text box. As you type, different selections from the
Taxonomy database will appear. Choose “plants” from the list.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
157
7. Click the blue “BLAST” button. After a few moments, the results will appear. Be sure to
read through Section 4.2 and 4.3 below before interpreting your results.
Note: If you sign up for an NCBI account (top right corner) and log in, NCBI will store
your
BLAST results for 36 hours. You can view your results by logging in and
selecting the “Recent Results” tab.
158 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Exercise 3.2 Calculating the Best Match for Your Contig.
Run your contig in BLASTn as done previously. Fill-in whichever table(s) below is appropriate.
Don’t worry if the Arabidopsis sequence is in the “-“ orientation (where the beginning value is
higher than the ending value).
Gene
GAPC
Arabidopsis
chromosome
Begin
End
> 1,080,957
< 1,083,357
Begin
End
> 4,608,193
< 4,610,644
Begin
End
> 29,920,795
< 29,924,127
Begin
End
> 5,574,304
< 5,577,616
Query
begin
Query
end
Score
Query
begin
Query
end
Score
Query
begin
Query
end
Score
Query
begin
Query
end
Score
3
Total score
Gene
GAPC-2
Arabidopsis
chromosome
1
Total score
Gene
GAPCP-1
Arabidopsis
chromosome
1
Total score
Gene
GAPCP-2
Total score
Arabidopsis
chromosome
1
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
159
Review Questions
1. Does an “E value” of zero mean that your sequence matched the subject sequence well or
poorly? Explain your answer.
2. What would it mean if you found a subject sequence with an “E value” of 3?
3. Why did we search the reference genomic database?
Exercise 4.1 Use BLASTn to Align the Contig to Reference mRNA Sequences
1. Locate your contig sequence that should now be on your desktop.
2. Go to the iFinch home page and click the “NCBI BLAST” link.
3. Select “nucleotide BLAST” from the BLAST home page.
4. Click the “Browse” button to locate your contig sequence on your desktop and upload it for
BLASTing.
5. Select “Reference mRNA sequence (refseq_rna)” for the database.
6. Type “plants” in the Organism box.
7. Select “BLASTn” (“Somewhat similar sequences”) as the program.
8. Click the “Algorithm parameters” link at the bottom of the BLAST page.
9. Change “Max target” sequence from “100” to “10”. This change will make the results easier
to interpret because fewer sequences will be shown.
160 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
10. Change the “Word size” to “7”. This will increase the sensitivity of the BLASTn search and
allow us to detect more distantly related sequences and short exons.
11. Click “BLAST” and your results should appear after a few moments.
Exercise 4.2 Reformat Your BLASTn Results
1. First, you will need to change the formatting of your BLASTn results to make the exon
boundaries easier to spot. Select “Reformat these Results” from the top of the page.
2. The “Format Request” page will appear.
3. There are a number of settings that you will change on this page.
a. Change the “Show” setting to “Plain Text.”
b. Change the “Alignment View” to “Query-anchored with dots for identities.”
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
161
c. Change all of the values in the “Limit results” row to “10”. This change will make the
results easier to read.
4. Click the “View” report button to see your re-formatted BLASTn results.
5. Save your results as a .txt file. Since you may be working with this file for some time, you
may find if helpful to upload these results and store them in your iFinch folder for
convenience. Print one copy of these results to hand-in. Keep this file open for now.
Exercise 4.3
Find the Exon-Intron Boundaries
1. Open the contig.txt file that is on your desktop and copy-and-paste all of the text into a blank
Microsoft Word® document. Change your left and right page margins to 0.3’.
2. Insert a blank space between the contig title and the DNA sequence, but make sure your
sequence is still in a FASTA format (i.e. begins with “>”). Keep this file open.
3. A query-anchored alignment shows your query sequence (in this case, your contig) at the top.
The alignments to all other sequences are shown below with identical bases shown as dots.
For this step, we will find and work with each exon in turn. For each exon, we will locate the
beginning of the exon in the contig sequence and mark it (in the Word® document) by
making the text a different color. Then we will find the 3´ end of the exon sequence and
mark where it’s located in the contig. Last, we will color all the text in between the 5´ and 3´
ends of the exon. When we are through, each exon will be shown in a different color than
the rest of the text, making it easier to pick out those regions of sequence. We are using an
162 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
example below from an experiment to clone GAPC from Salvia. Your results will be
different.
4. You should have two files open now: the BLAST alignment (.txt) and the contig you saved
in Word®. An example of the BLAST alignment is shown directly below. You can see that
the first exon begins at position 6 in our contig and reads AGCC.
5. Find the first AGCC in our contig sequence (this is in the Word® file), beginning at position
6, and color it red to mark the beginning of the exon.
6. Now use the BLAST alignment to find the 3′ end of the exon. Scan the BLAST results until
the aligned region appears to end. In this example, there are five possible locations where the
first exon might end. Since the second sequence is from Arabidopsis, and Arabidopsis is
probably the best characterized plant genome, we’ll guess for now, that this is the end point
for our exon. Our later results will help us determine if this is correct or not.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
163
7. Now you will need to find the location of the 3′ end in your contig. To do this, it’s helpful to
use your mouse to select and copy a section of the query sequence (about 20-25) bases from
the BLAST alignment that precedes the 3′ end (shown below).
8. Use that copied sequence and the “Find” command to search your Word® document and
locate this sequence in your contig. You can use Control-V to paste this sequence into the
“Find” box.
9. When you find the sequence, change the color of the text to mark it.
10. Now select all the bases between the beginning and end of the exon in the contig.doc file,
and change the color for those bases too.
11. Repeat this process for the remaining exons, marking the 5′ end, then the 3′ end, and then
coloring the bases in between. Be sure to mark the different exons with different colors of
text. Save this altered file.
12. When you are through your contig sequence will look something like this (except in color):
164 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
13. It may be helpful to store your sequences in your iFinch folder between class sessions.
Make
certain that you keep your sequence in a FASTA format.
Exercise 4.4 Check Your Proposed mRNA with BLASTn
Now, it’s time to check your work by doing a BLASTn search with your proposed mRNA sequence.
1. In the contig.doc file create the proposed mRNA sequence by copying-and-pasting the
colored regions only; putting them together in the same order to create a model for your
mRNA. Do this on the same page as the original contig sequence. When you are done, this
second sequence should not have any black letters. Make sure that your sequence is in a
FASTA format. It should look something like this:
2. Go to NCBI and select “nucleotide BLAST”.
3. Copy your mRNA sequence from your Word® document and paste it into the BLASTn
search box.
4. Choose the “Reference mRNA sequences” as your database.
5. Type “plant” for the Organism, and select the “plants” option.
6. Choose BLASTn as the program.
7. Click “Algorithm parameters” and choose “7” for the Word size. Click “BLAST”.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
165
Exercise 4.5 Interpreting Results
The results from our example mRNA sequence are shown below.
When you did a BLASTn search before, your query sequence contained segments (introns) that
wouldn’t be found in mRNAs. Since this time you used a possible mRNA segment as a query, your
BLAST results should not show gaps. We can see that our proposed mRNA matches several subject
sequences along the entire length without any gaps. This is because the mRNA sequence should be
in a single reading frame from beginning to end. Your results may be similar or you may find breaks
in the sequence where a portion of sequence is missing or differs between plant species. Ultimately,
the DNA sequence of all the clones from a single species will be aligned with each other to further
check for accuracy. Next, you will need to examine your results in further detail.
1. Reformat the BLAST results, as described earlier in Exercise 4.2 so that BLAST results
show a query-anchored alignment from ten subject sequences.
2. Scroll through the aligned sequences and look for possible discrepancies. The query
sequence appears to contain four more bases than the reference mRNAs. Dashes indicate
missing bases. These discrepancies could be errors that are indels, or from improperly
joined exons. It could also be due to sequencing errors. They could also be due to
differences between plant species.
3. Print out these results.
166 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
4. One way to check these discrepancies is to compare the sequence of your GAPDH clone
with that of other GAPDH clones from the same plant. We can do this since several of your
classmates worked with the same species. Email the instructor your final, corrected contig
so they can post it. Later you will compare them to each other (Exercise 6.1).
Review Questions
1. What kinds of sequences will we find in a genomic sequence: exons, introns, or both?
_______
2. What kinds of sequences will we find in an mRNA: exons, introns, or both? _______
Exercise 4.6 Translating the Predicted mRNA Sequence and Checking with BLASTx
Now we can predict the amino acid sequence of your contig. BLASTx translates nucleotide
sequences in all six possible reading frames (three for each strand). It also is capable of comparing
each of these putative amino acid sequences to a database of protein sequences published in the
GenBank. This is helpful since we don’t know which reading frame is the correct one.
1. Go to the “NCBI BLAST” home page.
2. Select “BLASTx”.
3. Copy the sequence of your final mRNA model and paste it in the query box.
4. Choose “Non-redundant protein sequences (nr)” for the database.
5. Choose “plants” as the organism.
6. Click “BLAST”.
7. If your mRNA model is correct, you should see the following results:
•
The alignment should span the entire length of your query sequence, from 0 to the
end.
•
The entire sequence should in a single reading frame, and there should not be any
gaps in the subject lines. If the top BLAST hit line is not continuous then the contig
submitted should be re-evaluated, because it still contains one or more base
insertions.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
167
Review Questions
1. BLASTx translates a nucleotide sequence in six reading frames and then uses each one to
query a protein database. Why are there six possible reading frames?
2. In the BLASTx results the letters are no longer limited to A, G, C, or T. What do letters in
the BLASTx results represent?
168 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Exercise 4.7
Protein
Translate your mRNA Sequence to Obtain the Predicted Sequence for your
1. Find the sequence of your final mRNA in your folder and click the file name to download
and open the file.
2. Make sure that the mRNA is sill in a FASTA format.
3. Select and copy the mRNA sequence.
4. Return to the iFinch home page.
5. Select “Sequence Utilities” from the home page.
6. A page from the “BCM Search Launcher” will appear. The “BCM Search Launcher” is
operated by the Baylor College of Medicine in Texas. There are several programs at this site
that you can use for manipulating and working with DNA sequences.
7. Paste your mRNA sequence in the text box. Make sure the sequence is preceded by “>”.
8. Select “6 Frame Translation” from the choices in the list. This will translate your mRNA
sequence in all six reading frames.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
169
What is your correct reading frame? Do this by comparing the output with the BLASTx
results in the previous section. Also, remember that reading frames contain no stop codons.
_________
9. Copy the amino acid sequence that you think is the correct translation and use it as a query
with BLASTp (BLASTp uses a protein sequence to search a protein database) from the
NCBI BLAST site. Set the search parameter to provide 250 max target sequences. If you’ve
picked the correct picked the correct sequence, it should match the correct protein. This
output can be used in the next exercise on phylogenetics.
Review Questions
1. Compare the results from the BLASTx search with the results from the BLASTp search.
How are the matches compare? What accounts for discrepancies?
Exercise 5.1 Phylogenetics
On the BLASTp results page examine the link called “Distance Tree of Results”. Each accession
listed in this output is connected by a line to other accessions. The shorter the line the more similar
the sequence. Common ancestors are indicated where two or more lines converge. Therefore, this
tree shows evolutionary relationships.
The gray bullets in the output are mostly for predicted or unknown proteins, while the colored
bullets are better characterized. The species can be established by placing the cursor over the
colored bullets.
170 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
1. List three plant species that have sequences that are most closely related to your gene. What
category do these closely-related plant species belong?
2. List three plant species that have sequences that are least related to your gene. What
category do these distantly-related plant species belong?
Exercise 6.1 GAPDH Sequence Alignment
1. Go to the Dolan DNA Learning Center www.bioservers.org and open the BioServers website.
2. In the left-hand column, hit the “Sequence Server.” (If you want, you can register and save
your
work).
3. To create GAPDH sequences for comparison:
A. Click the “Create Sequence” button.
B. Copy the final contig sequence of your clone and paste it into the Sequence window.
Enter a name for the sequence (like “Joe Tomato”), and click “OK”. When it has been
saved you can hit the “Cancel” button to see it listed.
C. Repeat with all of the other final contigs for that same species of plant.
D. Check the boxes on the left of the sequences you want to compare.
E. Click the “Compare” button. Your sequences are sent to Cold Spring Harbor Laboratory,
NY, where they will be aligned using CLUSTALW.
F. A new window will appear with your results.
G. To view the entire gene, change the parameter so that all of the bases show on the page.
Click “Redraw.” Yellow highlighting denotes disagreements between sequences.
H. Print results, and mark discrepancies with a highlighter pen.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
171
4. Are there discrepancies between the different clones? If so, what type of differences do you
see?
5. Why do you think there might be discrepancies between these different clones for the
GAPDH gene from the same plant? Give at least three reasonable explanations.
Notes for Instructor
Detailed instructions for media preparation, setting up PCR reactions, etc. are provided at
explorer.bio-rad.com.
See Appendices for examples of questions, discussion ideas, and for
supplementary ways to analyze results.
Materials
In addition to the Bio-Rad kit, the following equipment and supplies are needed for a lab section of
16 students (2 students per group; 8 groups per lab section):
Equipment
Thermal cycler
Micopipettors (0.5-2 µl; 2-20 µl; 20-200 µl; 200-1000 µl):
37oC water bath or heating block:
37oC incubator
Gel electrophoresis apparatus (32 wells):
Electrophoresis power supply:
Gel documentation system
Microcentrifuge:
Microwave:
Computers
-20oC freezer
Balance
Supplies
Filtered pipette tips (all sizes):
Tube racks
Cold block or chipped ice
Optional: Gels containing EtBr can be purchased from Bio-Rad
1
8
1
1
1
1
1
1
1
8
1
1
4 boxes
4
4
172 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Safety precautions and disposal: Consult with your campus Chemical Safety Division for guidelines
for disposal of gels stained with ethidium bromide (EtBr) or SYBR Green. Recombinant bacteria
should autoclaved or bleached before disposal.
Acknowledgements
Special thanks to Bellarmine University students Kathryne Blair, Catherine Brumm, Stephanie
Kortyka, Stephanie Mitchell, Melissa Pawley, Emily Whitledge and Sanda Zolj for their hours spent
performing this laboratory exercise and for their helpful suggestions. Thanks to all our past
Molecular Biology students at Bellarmine University who have been instrumental in helping us
develop this exercise.
The collaboration of the Joint Genome Institute (Department of Energy) in Walnut Creek, CA is
much appreciated.
Literature Cited
Altenberg, B., and Greulich, KO. 2005. Genes of glycolysis are ubiquitously overexpressed in 24
cancer classes. Genomics 84: 1014-1020.
Altschul, SF, Gish, W., Miller, W., Myers, EW and Lipman, DJ. 1990. Basic local alignment
search tool. Journal of Molecular Biology 215: 403-410.
Benson, D.A., I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. 2007. GenBank.
Nucleic Acids Research, 35: D21-D25.
Dennis, D.T. and S.D. Blakely. 2000. Carbohydrate Metabolism (Chapter 13). In: Biochemistry and
Molecular Biology of Plants (Eds. Buchanan, B.B., W. Gruissem, R.L. Jones), American Society of
Plant Physiologists, Rockville, MD.
Figge, R.M., Schubert, M., Brinkmann, H., and Cerff, R. 1999. Glyceraldehyde-3-phosphate
dehydrogenase gene diversity in eubacteria and eukaryotes: Evidence for intra- and inter-kingdom
gene transfer. Molecular Biology Evolution 16: 429-440.
Kim, JW. and Dang, CV. 2005. Multifaceted role of glycolytic enzymes. Trends in Biochemical
Science 30: 142-150.
Olsen, K.W., D. Moras, and M.G. Rossmann. 1975. Sequence variability and structure of dglyceraldehyde-3-phosphate dehydrogenase. Journal of Biological Chemistry 250: 9313-9321.
Sanger, F., and A.R. Coulson. 1975. A rapid method for determining sequences in DNA by primed
synthesis with DNA polymerase. Journal of Molecular Biology 25: 441-448.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
Sirover, M.A. 1999. New insights into an old protein: the functional diversity of mammalian
glyceraldehyde-3-phosphate dehydrogenase. Biochimica et Biophysica Acta 1432: 159-184.
Tatton, W.G., R.M. Chalmers-Redman, M. Elstner, W. Leesch, F.B. Jagodzinski, D.P. Stupak, M.M.
Sugrue, and N.A. Tatton. 2000. Glyceraldehyde-3-phosphate dehydrogenase in neurodegeneration
and apoptosis signaling. Journal of Neural Transmission, Supplementa, 60: 77-100.
173
174 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Appendix A
Example of results from initial and nested PCR with student questions/discussion topics.
G+ = genomic DNA (Arabidopsis),
positive control
P+ = plasmid DNA (Arabidopsis
GAPDH), positive control
W- = water, negative control
I = initial PCR
N = nested PCR
Questions:
1) Why did we include positive controls in this experiment?
2) Why did we include a negative control in this experiment?
Questions 3 – 7 pertain to the positive control using genomic DNA (Arabidopsis):
3) How many bands are present after the initial PCR?
4) What are the approximate sizes of the different bands produced in the initial PCR?
5) Why are there multiple bands present after the initial PCR?
6) How many bands are present after the nested PCR?
7) Briefly explain the differences seen in the initial vs. the nested PCR reactions?
8) Describe the sizes and intensities of the bands produced in your initial vs. nested PCR reactions.
9) Explain the differences between your initial and nested PCR reactions.
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
Appendix B
Example of results from a restriction enzyme digest of recombinant plasmid with student questions.
MW ladder sizes:
10, 8, 6, 5, 4, 3,
2.5, 2, 1.5, 1, and
0.5 kb
Questions:
1) Describe the function of the Restriction Enzyme (R.E.) used in this experiment.
2) Using semi-log paper, predict the sizes of the bands present in your lanes. How does this
compare
to other students who are working with the same plant species?
Notes to instructor…..Additional analysis for class discussion:
1) Explain the results seen in Lane 4?
2) Do you think that the clone examined in Lane 7 is for GAPDH?
3) Do you see any lanes that show more than 2 bands?
175
176 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
4) What are some possible explanations for 3 or more bands?
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
177
Appendix C
Example of a dot-plot for pair-wise comparison of two sequences. In this example, the two
sequences being aligned are the phrases “KEY LIME PIE” and “BIG MICKEY”. In the grid below,
an “X” was placed wherever the top and side phrases contained the same letter. The longest
diagonal formed by the contiguous “Xs” indicates the region of greatest identity, in this example
“KEY”. This is a simplified example of how a longer contig can be predicted from shorter
sequences that contain overlapping regions.
KEY LIME PIE
K E Y L
B
I
G
M
I
C
K X
E
X
Y
X
BIG MICKEY
I
M E
P
X
I
X
X
X
X
X
The final sequence (contig) in this example would be “BIG MICKEY LIME PIE”.
E
178 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Appendix D
Partial example of the output from a contig using the “CAP3” assembly server at the University of
Lyon.
QCDP866765.b2_I19.ab+
AAATATTAATATTGACAAGTATTACGAGCATGAAAAACATAATACAAGAATGGGTGCAAA
QCDP866758.b2_K17.abAAATATTAATATTGACAAGTATTACGAGCATGAAAAACATAATACAAGAATGGGTGCAAA
QCDP866715.b2_E07.abTGGGTGCAAA
____________________________________________________________
consensus
AAATATTAATATTGACAAGTATTACGAGCATGAAAAACATAATACAAGAATGGGTGCAAA
.
:
.
:
.
:
.
:
.
:
QCDP866765.b2_I19.ab+
CTTACTGCAAAGCATATCACAAAAAAACAAACTGAGACAGCTTAAATTAAATGCTTAAGG
QCDP866758.b2_K17.abCTTACTGCAAAGCATATCACAAAAAAACAAACTGAGACAGCTTAAATTAAATGCTTAAGG
QCDP866715.b2_E07.ab- CT-ACTGCAA-GCATATCACAAAAAA-CAAACTGAGACAGCTAAATTAAATGCTTAAGG
:
.
.
:
.
.
:
.
____________________________________________________________
consensus
CTTACTGCAAAGCATATCACAAAAAAACAAACTGAGACAGCTTAAATTAAATGCTTAAGG
.
:
.
:
.
:
.
:
:
QCDP866765.b2_I19.ab+
GGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC
QCDP866758.b2_K17.abGGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC
QCDP866715.b2_E07.abGGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC
____________________________________________________________
consensus
GGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC
.
:
.
:
.
:
.
:
:
QCDP866765.b2_I19.ab+
ATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA
QCDP866758.b2_K17.abATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
QCDP866715.b2_E07.abATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA
____________________________________________________________
consensus
ATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA
.
:
.
:
.
:
.
:
:
QCDP866765.b2_I19.ab+
GCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCCATCACAGTTTTCTGGGT
QCDP866758.b2_K17.abGCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCATCAACAGTTTTCTGGGT
QCDP866715.b2_E07.abGCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCATCAACAGTTTTCTGGGT
QCDP866751.b2_M15.ab+
TCAACAGTTTTCTGGGT
.
:
.
.
:
.
.
:
.
____________________________________________________________
consensus
GCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCATCAACAGTTTTCTGGGT
.
:
.
:
.
:
.
:
:
QCDP866765.b2_I19.ab+ AGCTGAAAAAAAAAATGC
QCDP866758.b2_K17.abAGCTGAAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT
QCDP866715.b2_E07.abAGCTGAAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT
QCDP866751.b2_M15.ab+ AGCTGAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT
____________________________________________________________
consensus
AGCTGAAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT
.
:
.
:
.
:
.
:
:
QCDP866758.b2_K17.ab- AT-ACACACCATG
QCDP866715.b2_E07.abATTACACACCATGTTGTTTTAAGTCAACAAAATCATCAAATACCAGTGATAGAGTGCACG
QCDP866751.b2_M15.ab+
ATTACACACCATGTTGTTTTAAGTCAACAAAATCATCAAATACCAGTGATAGAGTGCACG
____________________________________________________________
consensus
ATTACACACCATGTTGTTTTAAGTCAACAAAATCATCAAATACCAGTGATAGAGTGCACG
179
180 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Appendix E
An additional method to confirm introns and exons is the “NetPlantGene Server”, provided by the
Technical University of Denmark. The “NetPlantGene Server” is an artificial neural network
designed to predict intron splice sites in Arabidopsis (Hebsgaard et al., 1996)*. It is freely available
on the web.
1.
Go to the “NetPlantGene Server” at http://www.cbs.dtu.dk/services/NetPGene/
2.
Paste your final contig sequence into the sequence box.
3.
Hit “Submit Sequence”.
4.
Below is a partial example of an output. It shows the locations of the predicted intron/exon
splice sites and the relative confidence levels for those predictions.
Length:
1365 nucleotides.
24.1% A, 20.1% C, 19.9% G, 36.0% T, 0.0% X, 39.9% G+C
Donor splice sites, direct strand
--------------------------------pos 5'->3' phase strand confidence
293
0
+
1.00
470
1
+
0.96
669
0
+
1.00
1036
2
+
1.00
1204
2
+
0.96
5'
exon intron
3'
CCTTGCCAAG^GTAATTCTTG H
TCTATCACTG^GTATTTGATG
TGCTGCCAAG^GTATTCATGC H
CTGCCATCAA^GTGAGTTATC H
GTGACAGCAG^GTACCTTCAC
Acceptor splice sites, direct strand
-----------------------------------pos 5'->3' phase strand confidence
145
0
+
0.96
408
0
+
0.87
570
1
+
0.96
734
0
+
0.78
892
0
+
1.00
1119
2
+
1.00
1336
+
0.00
5'
intron exon
3'
TTTGAAATAG^GGCGGTGCCA
ATATTTGCAG^GTCATTAATG
TTTTTTTCAG^CTACCCAGAA
GGTAAAACAG^TGCGTGGACA
CTCCTTGCAG^GCTGTTGGAA H
GCCTTTTCAG^GGCCGAGTCT H
TTAAAAACAG^GTCTAGCATC
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
*Hebsgaard, S.M., P.G. Korning, N. Tolstrup, J. Engelbrecht, P. Rouze, and S. Brunak. 1996. Splice
site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence
information. Nucleic Acids Research 24:3439-3452.
181
182 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow
Appendix F
Intron/exon locations can be examined further by studying genomic GAPDH sequences that have
already been published in the GenBank. The mRNA feature for these accessions lists the base-pair
locations of each exon. These could be similar in homologous sequences. A partial rice sequence is
shown below, but you would be using the GenBank sequence that is most similar to the GAPDH
sequence you are studying (as revealed by BLAST). In this example, exons have been underlined.
LOCUS
DEFINITION
ACCESSION
VERSION
SOURCE
ORGANISM
AUTHORS
Habara,T.,
TITLE
ssp.
JOURNAL
NC_008397
3602 bp
DNA
linear
PLN 19-FEB-2008
Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 4.
NC_008397 REGION: 24280691.24284292
NC_008397.1 GI:115461545
Oryza sativa Japonica Group
Oryza sativa Japonica Group
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP
clade; Ehrhartoideae; Oryzeae; Oryza.
Ohyanagi,H., Tanaka,T., Sakai,H., Shigemoto,Y., Yamaguchi,K.,
Fujii,Y., Antonio,B.A., Nagamura,Y., Imanishi,T., Ikeo,K., Itoh,T.,
Gojobori,T. and Sasaki,T.
The Rice Annotation Project Database (RAP-DB): hub for Oryza sativ
Nucleic Acids Res. 34 (DATABASE ISSUE), D741-D744 (2006)
FEATURES
source
Location/Qualifiers
1..3602
/organism="Oryza sativa Japonica Group"
/mol_type="genomic DNA"
/cultivar="Nipponbare"
/db_xref="taxon:39947"
/chromosome="4"
gene
1..3602
/gene="Os04g0486600"
/db_xref="GeneID:4336216"
mRNA
join(1..86,216..239,333..433,977..1092,1179..1278,1378..1524,
1639..1699,1790..1887,
2428..2570,2684..2767,2848..2937,3328..3602)
/db_xref="GeneID:4336216"
/note="Glyceraldehyde-3-phosphate dehydrogenase, cytosolic
1141
1201
1261
1321
1381
1441
1501
1561
1621
1681
1741
1801
agtataccat
ggtgagactg
gccgctgctc
gtcataacaa
ggtgctaaga
gtcaatgaga
aactgccttg
caaattgtgt
caaattgtct
actgtccatg
gcgttgctta
accgttgatg
atcgctcaac
gcgctgagtt
acctgaaggt
ttcatatgaa
aggtcgtcat
aggagtacaa
ctccacttgc
ggtgatgtta
gcctgcaggt
caatcactgg
gcatttcttt
gaccctcgag
cttaatttct
tgttgtggag
attttctgcg
gaatgattct
ctctgctccc
gcctgacatc
caaggtgtgt
agaaattggt
tatcaatgac
tatgatcttt
gatgtaacac
caaggactgg
ctgctcagga
tccactggtg
aaacccaaca
catggtatca
agcaaggatg
gacattgtgt
tcctcactca
cactactacc
aggtttggta
gaagtacctg
tgtgctgtaa
aggggtggaa
accctgagga
ttttcactga
tgtcattgtt
tatgcgtttc
cccccatgtt
ccaatgctag
ttttcacttg
atatgaagcc
ttgttgaggg
tgatgttctt
ttcttacagc
gggctgccag
gatcccatgg
caaggacaag
tgattgtggt
aatatagggt
tgttgttggt
ctgcaccacc
tctactgttg
ttgtccttat
tttgatgacc
cattgttgtg
aactcagaag
tttcaacatc
Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics
1861
1921
1981
2041
2101
2161
2221
2281
2341
2401
2461
2521
2581
2641
2701
2761
2821
2881
2941
atccctagca
gaggtcagcc
tgattagtcg
acctttgtag
tacatgtcag
ccctatcaga
tttctgcctc
gaagtgcaac
tttgtttata
tttaactttt
aagttgactg
gtcaggcttg
cataaattag
catctaagac
aactcaaggg
acaacaggtg
cctgacctct
tctcaatgac
atcatttcta
gcactggagc
tccatgatga
ctcacttaag
catcaacaat
acaaaatttt
gaaggttcaa
caaatgcttc
cgtatgcatc
gtcttttcgc
gctataaaac
gtatggcttt
agaagccagc
aattgccctt
tagcaattga
tattctgggt
gagatctgtt
tttgctctgt
aactttgtta
tgtgcattag
tgcaaaggta
tttctgtctg
agtcaactct
gatttgttaa
taaacacgtt
agtacattgt
tttgtttgca
taagttgtaa
ctttgtgcta
ttgttaggct
ccgtgttcca
atcctatgat
acagaaagtg
cttttgtttg
tacgttgagg
taaattcgtc
gtaacaggtc
agcttgtgtc
aaagctatac
atcatgagat
tacatataat
ttgttaactt
agatctcttt
tacttataat
cctgaagcaa
tcatgtttgt
attcaaaatt
atcctgtttc
gtcggcaagg
accgtggatg
cagattaagg
aagaggctaa
ttctgttttc
aggaccttgt
ttgtatttgg
aagcatcttt
ttggtatgac
ttgtttttta
ttaatattat
ataggtgcat
cactcgagtt
gaatgcatta
tgaagttaaa
ctgtactaat
taccttcagc
tgagatcatt
ggggaaatgt
tgcttcctgc
tctctgtcgt
cagctatcaa
agtattgggt
tagggaggag
ttccacagac
atactcggtt
gatgcaaagg
aacgaatggg
gttcatggtt
cccatatgct
tagttagtaa
ggtaggctca
gcatttggtg
gctactgact
ttgaacacag
ataatctctg
atttaagaat
ttgtacatgt
tctcaatgga
tgatttgact
gtaagtgtag
accagaacgt
tctgagggga
ttccagggtg
actcgaagcc
ccggtattgc
gatacaggta
gtttggttca
183
Download