TGAC * Sequence Polymorphisms Module

advertisement
Blue Line Walkthrough
A. Examining DNA Sequence
Example Sequences:
rbcL sample 1
Tool(s):
Sequence Viewer
Concept(s):
DNA Barcoding, Sanger DNA Sequencing
DNA Barcoding: The process of
species identification by
examination of DNA Sequence.
rbcL: A gene coding the large
subunit of the enzyme RuBisCo,
and one of the important loci
for species identification of
plants.
Sanger DNA Sequencing: A
method of DNA sequencing
that uses fluorescently labeled
didexoynucleotide terminators
to generate the sequence of a
DNA sample.
Quality (Phred) Score:
Nucleotide calls read from
sequencing output files are
assigned a quality score of 10,
20, 30, 40, or 50. A score of 50
means that the base is called
with a 99.999% accuracy. A
score less than 20 is the cut-off
for high quality sequence.
I. Create Project
1. Log-in to DNA Subway (dnasubway.iplantcollaborative.org)
2. Click ‘Determine Sequence Relationships.’ (Blue Square)
3. Select project type ‘Barcoding: rbcL.’
4. Select sample sequence ‘rbcL sample 1.’
5. Provide your project with a title, then Click ‘Continue.’ Alternatively, if
you have sequenced your DNA using your Genewiz account, Select
‘Import trace files from DNALC.’ – Then select sequences to import.
II. View Sequence
6. Click ‘Sequence Viewer’ to show a list of your sequences.
7. Click on a sequence name to show the sequences’ trace file.
Questions:
Q.1:
What do you notice about the electropherogram peaks and quality scores at nucleotide positions
labeled “N”?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
Q.2:
Where do the ‘N’s’ in the sequence tend to be distributed, and Why?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
Additional Investigation: Learn more about Sanger Sequence at: http://www.dnalc.org/view/15479-Sanger-method-of-DNAsequencing-3D-animation-with-narration.html
1
B. Assembling and Editing DNA Sequence
Example Sequences:
rbcL sample 1 from Part A
Tool(s):
Sequence Trimmer, Pair Builder, Consensus Builder
Concept(s):
Sanger DNA Sequencing, bidirectional reads
Bidirectional sequence:
DNA sequence generated by
sequencing a DNA strand in the
forward and reverse
orientation.
Consensus sequence: A
sequence that sums the
consensus of two or more DNA
sequences.
I. Trim 5’/3’ ends
1. Click ‘Sequence Trimmer.’
2. Click ‘Sequence Trimmer’ again to examine to changes made in the
sequence
II. Pair Builder
1. Click ‘Pair Builder.’
2. Select the check boxes next to the sequences that represent
bidirectional reads of the same sequence set. Alternatively Select the ‘Try
Auto Pairing’ function and verify the pairs generated.
3. As necessary, Reverse Compliment sequences that were sequenced in
the reverse orientation by clicking the ‘F’ next to the sequence name. The
‘F’ will become an ‘R’ to indicate the sequence has been reverse
complimented.
4. Save the created pairs.
III. Consensus Builder
1. Click ‘Consensus Builder’
2. Click ‘Consensus Builder’ again to examine the created consensus files.
Any differences between two reads will be highlighted in yellow in the
consensus builder.
3. Make needed edits, and Save your changes.
Questions:
Q.3:
Sequence identified by DNA subway as low quality is marked by a symbol. What problems might it
cause to generate consensus sequence from low-quality DNA sequence?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
2
C. Matching sequence to databases
Example Sequences:
rbcL sample 1 from Part B
Tool(s):
BLAST, Upload Data, Reference Data
Concept(s):
BLAST Searches, GenBank, BOLD Database
BLAST: Basic Local Alignment
Search Tool (BLAST) is an
algorithm that search
databases of biological
sequence information (e.g.
DNA, RNA, or Protein
sequence) and return matches.
The BLASTN program is specific
to nucleotide data.
GenBank: The largest database
of publicly available nucleotide
sequences. As of 2011 the
database contains well over
100 billion nucleotides of
generated sequence data.
BOLD: Barcode of Life Online
Database (BOLD) is an online
repository for sequence data
generated by DNA barcoding
projects worldwide.
I. Check for matches in GenBank
1. Click ‘BLASTN.’
2. Click the ‘BLAST’ link for the sequence of interest.
3. Examine the BLAST matches for candidate identification. Clicking the
species name given in the BLAST hit will also give additional
information/photos of the listed species.
4. If desired, select the check box next to any hit, and select ‘Add BLAST
hits to project’ to add selected sequences to your project.
II. Upload Data (optional)
1. If desired, Click ‘Upload Data’ to import additional data into your
project. You will need to repeat steps in the ‘Assemble Sequences’ stop
on DNA Subway.
III. Reference Data (optional)
1. Click ‘Reference Data.’
2. Select one or more groups of sequences from selected reference
samples of rbcL sequence.
3. Select ‘Add ref data’ to add the data to your project.
Questions:
Q.4:
BLAST will return the closest matches present in GenBank. Will you be able to identify an unknown
species using BLAST alone? Why or Why not?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
Additional Investigation: See the laboratory: “Using Barcoding to identify and Classify Living Things.”
(http://www.urbanbarcodeproject.org/files/Barcoding_Protocol.pdf)
3
C. Building Phylogenetic Trees
Example Sequences:
rbcL sample 1 from Part C
Tool(s):
Select Data, MUSCLE, PHYLIP NJ, PHYLIP ML
Concept(s):
Sequence alignment, phylogenetics
Multiple Alignment: A (usually)
computer generated alignment
sequences. Under the
assumption that all sequences
within the alignment are
similar (e.g. of a common
genetic origin, from a common
locus, in the same strand
orientation) gaps are
introduced where
misalignments (e.g. insertions
or deletion/ missing data)
appear.
I. Select Data for Alignment
1. Click ‘Select Data.’
2. Select any and all sequences you wish to add to your tree.
3. Click ‘Save Selections.”
II. Generate Multiple Sequence Alignment
1. Click ‘MUSCLE.’
\
Phylogenetic tree: A diagram
which represents inferred
evolutionary relationships
between organisms. As applied
here, sequences are displayed
with branch lengths that are
proportional to the differences
between the sequences.
PHYLIP NJ and PHYLIP ML: Tree
building algorithms based on
the “Neighbor Joining” and
“Maximum likelihood methods
respectively. See:
http://www.icp.be/~opperd/pr
ivate/neighbor.html and
http://www.icp.ucl.ac.be/~opp
erd/private/max_likeli.html
1. Click ‘MUSCLE’ again to open the sequence alignment window.
2. Click ‘Trim Alignment’
3. Examine the alignment to help answer question 5.
III. Construct Phylogenetic Tree
1. Click either ‘PHYLIP NJ’ or ‘PHYLIP ML’ to run the tree construction
algorithm.
2. Click the button for the algorithm you chose above again to launch a
viewer for the multiple alignment and tree.
Questions:
Q.5:
What relationship do you see between sequences that have more mutations (align less well with
majority of sequences) in the alignment and the lengths of a sequences’ branch on the tree?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
Q.6:
Do you see differences in the phylogenetic tree generated by the Neighbor-joining vs. Maximum
likelihood method?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
4
Biological Concepts
Genomes
 A genome is an organism’s entire complement of DNA.
 DNA is a directional molecule composed of two anti-parallel strands.
 The genetic code is read in a 5’ to 3’ direction, referring to the 5’ and 3’ carbons of deoxyribose.
 Eukaryotic genomes contain large amounts of repetitive DNA, including simple repeats and transposons.
 Transposons can be located in intergenic regions (between genes) or in introns (within genes).
 Genes and transposons are directional, and can be encoded on either DNA strand.
 Repeats are non-directional, and, in effect, do occur on both strands.
 Transposons can mutate like any other DNA sequence.
Genes
 Protein-coding information in DNA and RNA begins with a start codon, is followed by codons, and ends with a
stop codon.
 Codons in mRNA (5’-AUG-3’, etc.) have sequence equivalents in DNA (5’-ATG-3’, etc.).
 The DNA strand that is equivalent to mRNA is called the “coding strand.” The complementary strand is called
the “template strand,” because it serves as the template for synthesizing mRNA.
 Non-spliced genes, which are characteristic of prokaryotes, are also found in eukaryotes.
 Even in a spliced gene, the protein-coding information may be organized as Open Reading Frame (ORF).
 Most eukaryotic genes are spliced, whereby intervening segments (introns) are removed and the remaining
segments (exons) are spliced together.
 Splice sites (exon-intron boundaries) have sequence patterns that are recognized by the splicing apparatus
(spliceosome).
 Gene prediction programs use consensus sequences around splice sites to predict exon-intron boundaries.
 Over 90% of eukaryotic introns have “canonical splice sites,” whereby introns begin with GT (mRNA: GU) and
end in AG (mRNA: AG).
 The protein coding sequence of a eukaryotic mRNA (or gene) is flanked by 5’- and 3’-untranslated regions
(UTRs); introns can be located in UTRs.
 In most eukaryotic genes, transcripts are alternatively spliced, yielding different mRNAs and proteins.
 UTRs hold information for the half-lives of mRNAs and for regulatory purposes.
 Gene > mRNA > CDS.
 CDS = nucleotides that encode amino acid sequence.
 In mRNA: CDS = ORF.
BLAST Searches
 Basic Local Alignment Search Tool (BLAST) searches databases for matches to a query DNA or protein
sequence.
 Gene or protein homologs share sequence similarities due to descent from a common ancestor.
 Biological evidence is needed to edit and confirm gene models predicted by computer algorithms.
 Biological evidence is most often derived from mRNA transcripts (ESTs, cDNAs, RNAseq). Protein sequence
data are available, too, but much less common.
 Many ESTs and cDNAs are disrupted by “introns” when they are aligned against genomic DNA.
 ESTs & cDNAs may be incomplete.
 The BLAST algorithm does not resolve intron/exon boundaries.
 The BLAST algorithm is not restricted to detecting sequences that fully match a query (“global” matches) but,
instead, matches query subsequences as well (“local” matches).
 The BLAST algorithm matches sequences to the fullest extent possible and, often, realigns the same sequence
twice.
5
Web Resources for Genome Annotation
A. Major Plant Genome Hubs:
DOE JGI’s http://www.phyotozme.net
University of Iowa: http://www.plantgdb.org/
CSHL: http://www.gramene.org/
ENSEMBL: http://plants.ensembl.org/index.html
NCBI: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
NCBI: http://www.ncbi.nlm.nih.gov/mapview/
B. Some Plant Genome Portals:
Arabidopsis, TAIR: http://www.arabidopsis.org/
Corn: http://www.maizesequence.org/index.html
Grape: http://www.cns.fr/externe/GenomeBrowser/Vitis/
Poplar: http://genome.jgi-psf.org/poplar/poplar.home.html
Rice: http://rice.plantbiology.msu.edu/
Tomato: http://solgenomics.net/about/tomato_sequencing.pl
C. Browsers:
Ensembl: http://www.ensembl.org
GBrowse: http://gmod.org/wiki/GBrowse
JBRowse: http://jbrowse.org/
UCSC Browser: http://genome.ucsc.edu
xGDB: http://brendelgroup.org/bioinformatics2go/bioinformatics2go.php
D. Other Resources:
Course download site: http://gfx.dnalc.org/files/evidence
DynamicGene: http://www.sanger.ac.uk/resources/software/artemis/
GeneBoy: http://www.dnai.org/geneboy/
BioServers: http://www.bioservers.org/bioserver/
mRNA/gDNA: http://www.ncbi.nlm.nih.gov/spidey/
mRNA/gDNA: http://pbil.univ-lyon1.fr/sim4.php
Splice site predictor: http://www.fruitfly.org/seq_tools/splice.html
Promoter predictor: http://www.fruitfly.org/seq_tools/promoter.html
6
Download