Uploaded by jordan lawson

DNA Barcoding - Subway Handout student (2)

advertisement
Week 3: Analysis of Sequencing Data on the DNA Subway
Goals:
● Use bioinformatic tools in the DNA Subway Blue Line to:
○ Visualize and edit your DNA sequences to create a DNA barcode
○ Search a national database of DNA sequences for matches to your DNA barcode
○ Create phylogenetic trees to help determine the taxonomic identity of your
specimen
Today we will use a website called DNA Subway to analyze our DNA sequences and attempt to
identify the specimen(s) we collected. Recall the steps that have led us up to this point:
● We found organisms, photographed them, and collected their tissue
● We extracted DNA from those tissue samples
● We amplified specific regions of that DNA extract using specific DNA barcoding primers
and the Polymerase Chain Reaction (PCR)
● We used gel electrophoresis to determine whether or not our PCR reactions were
successful
● We shipped off successfully amplified DNA samples to be sequenced (a forward and
reverse read for each sample)
Now we need to determine:
● Whether or not the DNA sequences are “accurate”
● Whether or not the DNA sequences are similar to those of other organisms
● Just how similar the DNA sequences are to other organisms’ DNA sequences
● Whether or not we can use the DNA sequences to help determine your specimen’s
taxonomic identity
We could do this by hand, by printing out our DNA sequences, lining them up with all the other
DNA sequences that are out there and trying to visualize which DNA sequences are most
similar to each other. However, this is extremely difficult when your DNA sequences are several
hundred base pairs long and there are literally millions of other DNA sequences for comparison.
So how do we make sense of all these complex data? Scientists in an emerging field of science
called bioinformatics have developed many of the tools we need to conduct these analyses.
Bioinformatics is an interdisciplinary field of science that combines computer science, statistics,
mathematics, and engineering to analyze and interpret complex biological data, like DNA
sequences.
Not surprisingly, bioinformatic techniques are often complex and very difficult for novices to use.
Fortunately for us, a website called DNA Subway (https://dnasubway.cyverse.org/) has
conveniently combined a number of complex bioinformatic tools into a simple, user-friendly way
to analyze DNA barcode sequences. Today we will learn how to use the “Blue Line” of the DNA
Subway to analyze our DNA sequences and in the process become novice bioinformaticians.
1
Overview of Today’s Activities:
1) Your instructor will navigate you step-by-step through the DNA Subway using a
presentation
2) After the practice round, you will work individually to analyze your own DNA sequences
using DNA Subway
3) You will record the results of your analysis in your lab notebook
4) You will use the results of your analysis to complete a Postlab worksheet
Procedures for Activity 2: Analyzing your DNA sequence(s)
1. Google “DNA Subway.” Navigate to the DNA Subway website. Log in.
2. Once your are logged in, click the blue square “Determining Sequence Relationships” on
the DNA Subway homepage to create a new Blue Line project.
3. Create your project
a. Select the proper barcoding project type for your specimen (ants are invertebrate)
i.
rbcl = plant; COI = invertebrate, fish, or other animal; 16S = bacteria; ITS
= fungi
b. Create a project title
i.
Last name_Specimen type (plant, fungi, fish, ant, etc.) BIO102_Sec#
1. Example. Smith_plant_BIO102_Sec10
c. Provide a project description
i.
Include info that will help you remember what this project was about
1. Ex. DNA barcodes extracted from invertebrates
d. Select “Import trace files...” (Forward and Reverse: 30-964965734)
i.
Paste the trace file number and click search and then select the file
number in the first column.
e. Select a single sequence. If you had an MRK number previously, you can log into
the Sample Database (https://sampledb.dnalc.org/) to find it. Forward and
Reverse (denoted by an F or R at the end) should have matching MRK numbers.
Select one forward and one reverse Click on “Add selected files.”
2
4. Take a minute to observe your Blue Line project page and orient yourself to its key
features. Find the following key features on your screen:
a. The “stops” on the Blue Line
i.
What are the 3 main stops on the Blue Line?
___________________________________
ii.
How many “substops” are there at each main stop?
_______________________________________
b. The “key”
i.
How many substops are “blocked” before you start your project?
________________________________________
c. The “project information”
i.
What is your project ID#? You or your instructor can use this number to
search for your project in the database.
______________________________________________
ii.
Make your project public so your instructor can view it
3
Main Subway Stop #1 - Assembling sequences:
The main goal of this stop is to ensure that your DNA sequences from your specimen are
“good.” So what does it mean for a sequence to be “good”? It means that your sequence:
(#1) is sufficiently long (should be ~700+ base pairs in length) and that
(#2) all of the 700+ A’s, T’s, G’s, and C’s in the section of DNA that you sequenced have been
‘called’ correctly by the sequencer.
There are several ways that we can gain confidence that the A’s, T’s, G’s, and C’s in your
sequence are correctly called and that your sequence is sufficiently long. The first step is
“viewing” the sequence.
1. Sequence Viewer: Click on the Sequence Viewer. Observe the icon to the left of the
green arrow. Move your mouse over the icon for each of your sequences. You can click
on this icon for each sequence to view the sequences.
The Sequence Viewer sub-stop in the DNA Subway allows you to view your
sequences in a format called a “trace file”. Trace files have several key pieces of
information. In the top row they have the color coded bases (A, T, G, C) that make up
your DNA sequence (see figure below). These are also called “base calls”. You may
see some N’s in your base calls. This means the base at that point in your sequence
could not be determined. In the second row is series of purple bars. These purple bars
tell you the level of confidence you can have that each base call is correct. If the purple
bars pass the horizontal Blue Line (threshold) that runs through them, this means you
can trust the base call. If the purple bars are below the threshold line, then the you will
be less confident that the base is called correctly. The numbers below the purple bars (in
intervals of 10) tell you the number of base pairs in your sequence. At the very bottom of
the trace file are a series of colored humps. These represent the color read out from gel
electrophoresis used to sequence your DNA sample.
2. Observe each trace file and answer the following questions:
a. How many total base pairs long is your forward read? _______________
b. How many total base pairs long is your reverse read? _______________
4
c. Where in your sequence (5’ end, middle, 3’ end) do you see the highest
concentration(s) of N’s (= bad base calls). ___________________________
d. Are both of your sequences “good”? _______________________________
e. If so, explain how you know this:
f. If not, which sequence(s) is/are low quality(forward/reverse)?___________
g. Explain how you know this:
h. If one of your sequences is low quality, you will skip steps 5-7 below. Once you
have completed these steps, close the Sequence Viewer window.
3. Sequence Trimmer: You should have noticed a high concentration of N’s (low
confidence base calls) at the beginning (5’ end) and end (3’ end) of your DNA
sequence(s). This is a typical result of the Sanger DNA sequencing procedure that can
easily be remedied by “trimming” off these bad calls from each end of your sequence.
5
Trimming off these bad calls helps to “clean up” your DNA sequence, so it is composed
of only accurate base pairs.
4. Click on the ‘Sequence Trimmer’ button to trim each of your sequences. Click on the
Sequence Trimmer button a second time to open the Sequence Trimmer window, and
then click on the icon to left of the green arrows to view the trace files for each of your
sequences. Answer the following questions:
a. How many base pairs long is your trimmed forward read? _______________
b. How many base pairs long is your trimmed reverse read? _______________
c. How many base pairs did you trim off of each read?_________________
5. Pair Builder: Most of you should have 2 good DNA sequences from each of your
samples: a forward read and a reverse read. Recall that each of these reads is from the
same region of DNA in your organism, just from antiparallel strands. Thus they convey
the same information. These two reads essentially act as a way to double check that
your sequence information is correct, as long as you can link them together in the right
way to perform this check. The Pair builder function in DNA Subway can do this for you,
but first you have tell it which DNA sequences to pair up.
6. Click on ‘Pair Builder’ to link your forward sequencing read with the corresponding
reverse read. This will help generate a consensus sequence. The consensus
sequence represents the best agreement between the forward and reverse sequencing
reads, which enables you to generate a longer DNA barcoding sequence than a single
forward or reverse read alone.
a. Click on the ‘F’ to the right of the REVERSE sequence in your pair. By clicking on
the F, the entry will change to ‘R,’ indicating the sequence has been converted to
the reverse complement.
b. Check the boxes for the two sequences that you wish to pair and confirm your
selection in the pop-up. If you are unable to check your boxes, skip to step d.
c. After the sequences are paired, save your pairs.
d. If you only have 1 good read then you won’t be able to pair up your data, so just
skip this step as well as step #7 (consensus builder).
7. If you could successfully complete step 6, Click on ‘Consensus Builder’ to align the
paired forward and reverse reads.
a. Once the program has finished, click on Consensus Builder again to view the
consensus sequence.
b. Click on [Trim Consensus] - but DON’T trim your consensus sequence. This
window, however, will give you the length of your consensus sequence.
c. While in the ‘Trim Mode,’ take a screenshot of the end portion of your consensus
sequence for your lab notebook. Please make sure your screenshot includes the
length of the consensus sequence. If you could not make a consensus
sequence, then include a screenshot of the 3’ end of your longest individual
sequence and add this to your notebook. This will enable your instructor to
determine the length of your longest read.
6
i.
ii.
iii.
How many base pairs long is your consensus
sequence?______________________
How many base pairs longer is this than the longest individual (F or R)
read?_____________________
Why is this important? Aka, why is it better to have a consensus
sequence? _______________
d. Click on [Exit Trim Mode] to return to the Consensus Editor View.
e. Close the Consensus Editor window by clicking on the [X] in the upper right
corner.
Congratulations! Now you have your trimmed, edited, consensus DNA sequence(s). You have
created a DNA barcode!
Stop 2 - Add Sequences
BLAST or Basic Local Alignment Search Tool is an important tool in bioinformatics. BLAST
uses an algorithm for comparing sequence information, such as the amino-acid sequences of
different proteins or the nucleotides of DNA sequences. A BLAST search (or query) is
essentially the same a Google search, but instead of searching for websites it searches for DNA
sequences. Just like Google, a BLAST search can quickly identify any close matches (or hits)
to your DNA sequence by comparing it to sequences maintained in a national database that
contains millions of sequences collected from organisms all over the world. A BLAST search
can potentially use your sequence to identify your organism to the genus or species level.
However, if your exact DNA sequence is not in the database, then the BLAST search will list the
most closely related matches.
Follow the instructions below to conduct a BLAST query for your DNA sequence(s).
1. On the Add Sequences line of the home page, click on BLASTN. Then click on the
BLAST button next to the sequence that you want to query against the database. This
could be your high quality consensus sequence, or it might be just one of your
sequencing reads. It will take 20-30 seconds for your BLAST results to be returned.
Once the query is finished, the results will pop up automatically. You should see
something like this:
7
2. Take a screenshot of your top 7-10 BLAST hits for your Lab notebook.
3. The 20 or more most significant alignments (or BLAST hits) will be listed in a table for
you.
a. The second column in this table has an Accession Number, which is a unique
identifier given to each sequence submitted to the database.
b. The Details column includes taxonomic information on the hit and a description
of the sequence information that was used, such as the name of the gene that
was sequenced (Ex. cytochrome oxidase = COI).
c. Click on the taxonomic name of each hit for a link to an image of the organism
and additional links.
i.
Use this taxonomic information to answer question 2 in your post
lab.
d. The last few columns show statistics that allow you to determine the “quality” of
each hit to your sequence query.
i.
The first column, tells you your alignment length. In other words, this
tells you how many base pairs of your sequence were used to make a
match. Typically the longer the alignment length, the better. This is
because a greater number of base pairs (more information) is being used
to make the match between your query and your hit(s).
1. Fill in this information to answer question 2 in your post lab.
ii.
The next column has a Bit Score. The bit score is calculated using a
formula that takes into account the length of the query sequence used
and the number of mismatches between your query sequence and each
hit from your BLAST search. The higher the bit score, the better the
alignment.
1. Fill in this information to answer question 2 in your post lab.
8
iii.
The Expectation or e value is the likelihood that the match BLAST made
between your query sequence and your hit could occur by chance. The
lower the e value, the higher the probability that the hit is truly related to
your sequence (and not just matching up by chance). For example, an E
value of 0 means that there is a 0% chance that that particular match is
just by chance.
1. Fill in this information to answer question 2 in your post lab.
iv.
In the last column, the number of Mismatches tells you how many base
pairs don’t match up between your query sequence and each hit from
BLAST. This gives a rough idea of how closely the two sequences match
and therefore how closely related they are likely to be. Typically, More
mismatches = More sequence differences = Greater evolutionary
distance = Less closely related.
1. Fill in this information to answer question 2 in your post lab.
4. Add BLAST sequence data to your phylogenetic analysis by checking the boxes next to
the accession numbers of your top 10 taxonomically unique (if possible) hits (rated by bit
score). Avoid repeating the same species if possible, and avoid entries like ‘Uncultured
Sp,’ ‘Uncultured Fungi,’ “Fungi sp.,’ and ‘Basidiomycota sp.’ etc. Once you select your
BLAST hits, you cannot go back and choose different ones. So please ask your
instructor if you are unsure. Once you have selected your Blast hits, click on ‘Add
BLAST hits to project.’
5. Skip the Upload Data stop. Next, click on ‘Reference Data’ to select data that will let
you compare your barcode sequence to barcodes of other, more common species that
9
are closely related to your organism. This will be helpful for creating an obvious outgroup
for your phylogenetic analyses. It will also help to place your organism into a larger
taxonomic context. Use the table below to select the appropriate reference group for
your specimen.
If your sample is a(n):
Reference Data Set to add
Fish
Common Fish
Shark
Sharks
Fungus
Fungi
Plant
Common Plants
Insect or other invertebrate
Common Invertebrates
6. Click on ‘Add ref data’ to add those sequences to your project. Close out the popup
window.
Stop 3 - Analyze Sequences:
In many cases, an unknown species can be identified by a BLAST search. However, it can be
very useful to create visual representations of the results of your BLAST search. These visual
representations can help you gain a deeper understanding of your results and aid in their
analysis. At DNA Subway stop 3, we will create two different kinds of visual representations of
our data: A sequence alignment and phylogenetic trees. These visual representations will be
very helpful in analyzing and interpreting our data. They can also add depth to our analyses by
showing how closely your sequence is related to other organisms from your BLAST query
based on differences in their DNA barcodes. Follow the procedures below to create a sequence
alignment and some phylogenetic trees.
1. Under ‘Analyze Sequences,’ click on ‘Select Data,’
2. Select ‘User Data’, ‘Blast hits’ and one species from your reference data to serve as
an outgroup. Then click on ‘Save Selections.’
3. Next, click on MUSCLE to align the sequences. MUSCLE stands for Multiple Sequence
Comparison by Log Expectation. Basically what this program does is line up all the DNA
sequences you’ve selected in the previous step into a nicely organized column. This is
called an “alignment”. Once the program has finished, click on MUSCLE again to view
the alignment. This alignment allows you to more easily visualize the similarities and
differences between your selected sequences (see figure below).
10
Each row in this MUSCLE alignment represents the DNA sequence of an individual “hit’ from
your BLAST search. The labels to the left of the alignment identify each sample. The numbers
above the alignment (at intervals of 100) represent the base pair position in the sequence. The
grey areas in each column represent base pairs that match in each sequence. The colored
areas represent polymorphisms, or locations where the base pairs are different across
sequences. Each color is representative of a specific nucleotide (A=green, T=red, C=blue,
G=black)
4. Click on the + to zoom in, the - to zoom out, and ATCG to switch to nucleotides instead
of colored bars.
a. Scroll through your alignments to view similarities among sequences.
Nucleotides are color-coded and each row of nucleotides is the barcode
sequence of a single organism. Columns are matches (or mismatches) at a
single nucleotide position across all sequences. Dashes (-) are gaps in the
sequence, where nucleotides in one sequence are not represented in other
sequences.
i.
ii.
What is the length (in base pairs) of your alignment? __________bp
Which sequence looks like it is the least similar (aka has the most
base pair differences) to your DNA barcode sequence?
iii.
Which sequence(s) looks like it is the most similar to your DNA
barcode sequence?
b. You may notice that some sequences are longer than others. These will have
grey areas at each end that extend beyond the edges of your DNA barcode
sequence. In order to make fair comparisons between our sequences, we need
to trim them all down to an equal size. Click on the ‘Trim Alignment’ button to trim
off any unaligned ends.
i.
Unaligned ends are scored as mismatches by tree-building algorithms, so
it is very important that they are removed.
11
ii.
Take a screenshot of your Trimmed MUSCLE alignment for your lab
notebook.
c. Click on the ‘Sequence Similarity %’ button in the upper R corner of your
MUSCLE alignment. This will open up a popup window that displays the %
similarity between all of your samples. For example, a value of 99% means that
only 1% of the base pairs are different between the two sequences being
compared. Thus the higher the sequence similarity value, the more closely
related these two sequences are, and presumably these two organisms.
i.
What is the percent similarity between your sequence and the
closest match from you BLAST search? ___%
ii.
Name of closest match:_________
iii.
What is the percent similarity between your sequence and the least
similar match? _________%
iv.
Name of worst (least similar) match:__________________ This will
serve as your OUTGROUP.
d. Close the Sequence Similarity window and the MUSCLE alignment window to
return to your project home page
Visualizing your data using phylogenetic trees:
Phylogenetic trees are another helpful way to visualize the relationships between your
sequence and your BLAST hits. Follow the directions below to generate two different types of
phylogenetic trees: Neighbor joining and Maximum likelihood.
5. Click on ‘PHYLIP NJ’ to generate a phylogenetic tree using the neighbor-joining (NJ)
method.
a. Click on PHYLIP NJ again to open a popup window with your neighbor-joining
tree. It should look something like this:
12
b. Confirm that the proper outgroup has been selected for your NJ tree using the
pulldown menu above the tree. The outgroup should be labeled by red text.
c. Look at the scientific names of the five most closely related organisms on the
tree. Are they in the same species, genus, family, class, or order? Whichever
taxonomic level they have in common is the level to which you can feel confident
in identifying your organism. Record this information in your postlab.
i.
What is the most specific taxonomic level that your NJ tree enabled
you to identify your DNA barcode sequence to? (Ex. Species?
Genus? Family? Order?)__________________
ii.
What is the taxonomic information for the most closely related
match to your DNA barcode?________________
d. Take a screenshot of your neighbor-joining tree for your lab notebook. Then
close the window with your NJ tree and return to the project home page.
6. Click on ‘PHYLIP ML’ to generate a phylogenetic tree using the maximum-likelihood
(ML) method.
a. Click on PHYLIP ML again to the open the tree in a new window. The branch tips
are the DNA sequences of individual taxa that you analyzed. Two or more
branches are connected to each other by a node, which represents the common
ancestor of the those taxa (see figure below).
13
b. The length of each horizontal branch is a measure of the evolutionary distance
from the ancestral sequence at the node. Taxa with short horizontal branches
from a node are closely related. Those with longer horizontal branches are more
distantly related. For example, in the tree above, Ginkgo is the least related to
the other taxa.
c. At the top of the tree is a pulldown menu that allows the outgroup to be changed.
Make sure to change the outgroup to the least similar taxa from your MUSCLE
alignment.
d. Find your DNA barcode sequence (unknown in this tree) and evaluate your
sequence’s position in the tree.
i.
If your sequence is closely related to any of the reference or uploaded
sequences, it will share a single node with those species. If your
sequence is identical (or close to identical) to another sequence, the two
will emerge directly from the vertical line of the node without any
horizontal branching.
ii.
If your sequence is distantly related to all of the species in your tree, your
sequence will sit on a branch by itself - with the other sequences grouping
together as a clade
iii.
Look at the scientific names of sequences within the most closely
associated clade. If all members share the same genus name, and no
members of that genus are outside of that clade, you have identified your
specimen as belonging to that genus. If members of that genus are
represented outside that clade, check whether they belong to the same
family or order. If all members of the clade are from the same family, you
can identify your sample to the family level, but not to a specific genus. If
all members of the clade belong to the same order, you can identify your
specimen to the order level, but not to a specific family. Record this
information in your postlab.
14
1. What is the most specific taxonomic level that your ML tree
enabled you to identify your DNA barcode sequence to? (Ex.
Species? Genus? Family? Order?)__________________
2. What is the taxonomic information for the most closely
related match to your DNA barcode?________________
e. Take a screenshot of your maximum likelihood tree for your lab notebook.
Organism Natural History
Do a google image or Wikipedia search for the 3 best matches from your BLAST query. Copy
and paste the images into your Lab Notebook entry for the week (cite your sources - also copy
and paste the website link). Below the image, note the geographic range for the organism.
Which of these organism(s) looks most similar to your specimen? ________________________
Does the geographic range of the most similar species make sense if this is your species? ____
Organism Classification
Based on the results from your BLAST search, MUSCLE alignment, phylogenetic trees, image
search, and geographic range search, what is the most specific taxonomic level that you are
confident that your specimen falls under?
Kingdom
Phylum
Class
Order
Family
Genus
Species
BEFORE LEAVING - please write a lab notebook entry for your DNA Subway project. Each
individual should download and complete ONLY your own DNA Subway information (you don’t
need to include the information for previous labs!) for the lab notebook entry.
Items to include in your Lab Notebook
1. Title: A general title and the date
2. DNA Subway Blue Line Project Info: Your DNA Subway name and Blue Line project title
3. Consensus sequence: A screenshot of your consensus sequence (or 3' end of longest
individual read). Be sure to label each image with the length of the sequence.
4. BLAST hits: A screenshot of your top 7-10 BLAST hits. This should include all columns
(alignment length, bit score, e-value, and mismatches) for each hit.
5. MUSCLE alignment and % similarity: A screenshot of your trimmed MUSCLE alignment
and a written statement explaining the % similarity between your sample and its closest
match.
6. NJ tree: A screenshot of your NJ tree with the properly selected outgroup and YOUR
SAMPLE HIGHLIGHTED. Be sure to state which taxon is most closely related to yours.
7. ML tree: A screenshot of your ML tree with the properly selected outgroup and YOUR
SAMPLE HIGHLIGHTED. Be sure to state which taxon is most closely related to yours.
8. Google/Wikipedia images of top 3 BLAST hits: You can do less than three if there aren't 3
obviously different hits.
9. Organism ID: What is the most accurate taxonomic identity of your organism?
15
Download