Week 3: Analysis of Sequencing Data on the DNA Subway Goals: ● Use bioinformatic tools in the DNA Subway Blue Line to: ○ Visualize and edit your DNA sequences to create a DNA barcode ○ Search a national database of DNA sequences for matches to your DNA barcode ○ Create phylogenetic trees to help determine the taxonomic identity of your specimen Today we will use a website called DNA Subway to analyze our DNA sequences and attempt to identify the specimen(s) we collected. Recall the steps that have led us up to this point: ● We found organisms, photographed them, and collected their tissue ● We extracted DNA from those tissue samples ● We amplified specific regions of that DNA extract using specific DNA barcoding primers and the Polymerase Chain Reaction (PCR) ● We used gel electrophoresis to determine whether or not our PCR reactions were successful ● We shipped off successfully amplified DNA samples to be sequenced (a forward and reverse read for each sample) Now we need to determine: ● Whether or not the DNA sequences are “accurate” ● Whether or not the DNA sequences are similar to those of other organisms ● Just how similar the DNA sequences are to other organisms’ DNA sequences ● Whether or not we can use the DNA sequences to help determine your specimen’s taxonomic identity We could do this by hand, by printing out our DNA sequences, lining them up with all the other DNA sequences that are out there and trying to visualize which DNA sequences are most similar to each other. However, this is extremely difficult when your DNA sequences are several hundred base pairs long and there are literally millions of other DNA sequences for comparison. So how do we make sense of all these complex data? Scientists in an emerging field of science called bioinformatics have developed many of the tools we need to conduct these analyses. Bioinformatics is an interdisciplinary field of science that combines computer science, statistics, mathematics, and engineering to analyze and interpret complex biological data, like DNA sequences. Not surprisingly, bioinformatic techniques are often complex and very difficult for novices to use. Fortunately for us, a website called DNA Subway (https://dnasubway.cyverse.org/) has conveniently combined a number of complex bioinformatic tools into a simple, user-friendly way to analyze DNA barcode sequences. Today we will learn how to use the “Blue Line” of the DNA Subway to analyze our DNA sequences and in the process become novice bioinformaticians. 1 Overview of Today’s Activities: 1) Your instructor will navigate you step-by-step through the DNA Subway using a presentation 2) After the practice round, you will work individually to analyze your own DNA sequences using DNA Subway 3) You will record the results of your analysis in your lab notebook 4) You will use the results of your analysis to complete a Postlab worksheet Procedures for Activity 2: Analyzing your DNA sequence(s) 1. Google “DNA Subway.” Navigate to the DNA Subway website. Log in. 2. Once your are logged in, click the blue square “Determining Sequence Relationships” on the DNA Subway homepage to create a new Blue Line project. 3. Create your project a. Select the proper barcoding project type for your specimen (ants are invertebrate) i. rbcl = plant; COI = invertebrate, fish, or other animal; 16S = bacteria; ITS = fungi b. Create a project title i. Last name_Specimen type (plant, fungi, fish, ant, etc.) BIO102_Sec# 1. Example. Smith_plant_BIO102_Sec10 c. Provide a project description i. Include info that will help you remember what this project was about 1. Ex. DNA barcodes extracted from invertebrates d. Select “Import trace files...” (Forward and Reverse: 30-964965734) i. Paste the trace file number and click search and then select the file number in the first column. e. Select a single sequence. If you had an MRK number previously, you can log into the Sample Database (https://sampledb.dnalc.org/) to find it. Forward and Reverse (denoted by an F or R at the end) should have matching MRK numbers. Select one forward and one reverse Click on “Add selected files.” 2 4. Take a minute to observe your Blue Line project page and orient yourself to its key features. Find the following key features on your screen: a. The “stops” on the Blue Line i. What are the 3 main stops on the Blue Line? ___________________________________ ii. How many “substops” are there at each main stop? _______________________________________ b. The “key” i. How many substops are “blocked” before you start your project? ________________________________________ c. The “project information” i. What is your project ID#? You or your instructor can use this number to search for your project in the database. ______________________________________________ ii. Make your project public so your instructor can view it 3 Main Subway Stop #1 - Assembling sequences: The main goal of this stop is to ensure that your DNA sequences from your specimen are “good.” So what does it mean for a sequence to be “good”? It means that your sequence: (#1) is sufficiently long (should be ~700+ base pairs in length) and that (#2) all of the 700+ A’s, T’s, G’s, and C’s in the section of DNA that you sequenced have been ‘called’ correctly by the sequencer. There are several ways that we can gain confidence that the A’s, T’s, G’s, and C’s in your sequence are correctly called and that your sequence is sufficiently long. The first step is “viewing” the sequence. 1. Sequence Viewer: Click on the Sequence Viewer. Observe the icon to the left of the green arrow. Move your mouse over the icon for each of your sequences. You can click on this icon for each sequence to view the sequences. The Sequence Viewer sub-stop in the DNA Subway allows you to view your sequences in a format called a “trace file”. Trace files have several key pieces of information. In the top row they have the color coded bases (A, T, G, C) that make up your DNA sequence (see figure below). These are also called “base calls”. You may see some N’s in your base calls. This means the base at that point in your sequence could not be determined. In the second row is series of purple bars. These purple bars tell you the level of confidence you can have that each base call is correct. If the purple bars pass the horizontal Blue Line (threshold) that runs through them, this means you can trust the base call. If the purple bars are below the threshold line, then the you will be less confident that the base is called correctly. The numbers below the purple bars (in intervals of 10) tell you the number of base pairs in your sequence. At the very bottom of the trace file are a series of colored humps. These represent the color read out from gel electrophoresis used to sequence your DNA sample. 2. Observe each trace file and answer the following questions: a. How many total base pairs long is your forward read? _______________ b. How many total base pairs long is your reverse read? _______________ 4 c. Where in your sequence (5’ end, middle, 3’ end) do you see the highest concentration(s) of N’s (= bad base calls). ___________________________ d. Are both of your sequences “good”? _______________________________ e. If so, explain how you know this: f. If not, which sequence(s) is/are low quality(forward/reverse)?___________ g. Explain how you know this: h. If one of your sequences is low quality, you will skip steps 5-7 below. Once you have completed these steps, close the Sequence Viewer window. 3. Sequence Trimmer: You should have noticed a high concentration of N’s (low confidence base calls) at the beginning (5’ end) and end (3’ end) of your DNA sequence(s). This is a typical result of the Sanger DNA sequencing procedure that can easily be remedied by “trimming” off these bad calls from each end of your sequence. 5 Trimming off these bad calls helps to “clean up” your DNA sequence, so it is composed of only accurate base pairs. 4. Click on the ‘Sequence Trimmer’ button to trim each of your sequences. Click on the Sequence Trimmer button a second time to open the Sequence Trimmer window, and then click on the icon to left of the green arrows to view the trace files for each of your sequences. Answer the following questions: a. How many base pairs long is your trimmed forward read? _______________ b. How many base pairs long is your trimmed reverse read? _______________ c. How many base pairs did you trim off of each read?_________________ 5. Pair Builder: Most of you should have 2 good DNA sequences from each of your samples: a forward read and a reverse read. Recall that each of these reads is from the same region of DNA in your organism, just from antiparallel strands. Thus they convey the same information. These two reads essentially act as a way to double check that your sequence information is correct, as long as you can link them together in the right way to perform this check. The Pair builder function in DNA Subway can do this for you, but first you have tell it which DNA sequences to pair up. 6. Click on ‘Pair Builder’ to link your forward sequencing read with the corresponding reverse read. This will help generate a consensus sequence. The consensus sequence represents the best agreement between the forward and reverse sequencing reads, which enables you to generate a longer DNA barcoding sequence than a single forward or reverse read alone. a. Click on the ‘F’ to the right of the REVERSE sequence in your pair. By clicking on the F, the entry will change to ‘R,’ indicating the sequence has been converted to the reverse complement. b. Check the boxes for the two sequences that you wish to pair and confirm your selection in the pop-up. If you are unable to check your boxes, skip to step d. c. After the sequences are paired, save your pairs. d. If you only have 1 good read then you won’t be able to pair up your data, so just skip this step as well as step #7 (consensus builder). 7. If you could successfully complete step 6, Click on ‘Consensus Builder’ to align the paired forward and reverse reads. a. Once the program has finished, click on Consensus Builder again to view the consensus sequence. b. Click on [Trim Consensus] - but DON’T trim your consensus sequence. This window, however, will give you the length of your consensus sequence. c. While in the ‘Trim Mode,’ take a screenshot of the end portion of your consensus sequence for your lab notebook. Please make sure your screenshot includes the length of the consensus sequence. If you could not make a consensus sequence, then include a screenshot of the 3’ end of your longest individual sequence and add this to your notebook. This will enable your instructor to determine the length of your longest read. 6 i. ii. iii. How many base pairs long is your consensus sequence?______________________ How many base pairs longer is this than the longest individual (F or R) read?_____________________ Why is this important? Aka, why is it better to have a consensus sequence? _______________ d. Click on [Exit Trim Mode] to return to the Consensus Editor View. e. Close the Consensus Editor window by clicking on the [X] in the upper right corner. Congratulations! Now you have your trimmed, edited, consensus DNA sequence(s). You have created a DNA barcode! Stop 2 - Add Sequences BLAST or Basic Local Alignment Search Tool is an important tool in bioinformatics. BLAST uses an algorithm for comparing sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search (or query) is essentially the same a Google search, but instead of searching for websites it searches for DNA sequences. Just like Google, a BLAST search can quickly identify any close matches (or hits) to your DNA sequence by comparing it to sequences maintained in a national database that contains millions of sequences collected from organisms all over the world. A BLAST search can potentially use your sequence to identify your organism to the genus or species level. However, if your exact DNA sequence is not in the database, then the BLAST search will list the most closely related matches. Follow the instructions below to conduct a BLAST query for your DNA sequence(s). 1. On the Add Sequences line of the home page, click on BLASTN. Then click on the BLAST button next to the sequence that you want to query against the database. This could be your high quality consensus sequence, or it might be just one of your sequencing reads. It will take 20-30 seconds for your BLAST results to be returned. Once the query is finished, the results will pop up automatically. You should see something like this: 7 2. Take a screenshot of your top 7-10 BLAST hits for your Lab notebook. 3. The 20 or more most significant alignments (or BLAST hits) will be listed in a table for you. a. The second column in this table has an Accession Number, which is a unique identifier given to each sequence submitted to the database. b. The Details column includes taxonomic information on the hit and a description of the sequence information that was used, such as the name of the gene that was sequenced (Ex. cytochrome oxidase = COI). c. Click on the taxonomic name of each hit for a link to an image of the organism and additional links. i. Use this taxonomic information to answer question 2 in your post lab. d. The last few columns show statistics that allow you to determine the “quality” of each hit to your sequence query. i. The first column, tells you your alignment length. In other words, this tells you how many base pairs of your sequence were used to make a match. Typically the longer the alignment length, the better. This is because a greater number of base pairs (more information) is being used to make the match between your query and your hit(s). 1. Fill in this information to answer question 2 in your post lab. ii. The next column has a Bit Score. The bit score is calculated using a formula that takes into account the length of the query sequence used and the number of mismatches between your query sequence and each hit from your BLAST search. The higher the bit score, the better the alignment. 1. Fill in this information to answer question 2 in your post lab. 8 iii. The Expectation or e value is the likelihood that the match BLAST made between your query sequence and your hit could occur by chance. The lower the e value, the higher the probability that the hit is truly related to your sequence (and not just matching up by chance). For example, an E value of 0 means that there is a 0% chance that that particular match is just by chance. 1. Fill in this information to answer question 2 in your post lab. iv. In the last column, the number of Mismatches tells you how many base pairs don’t match up between your query sequence and each hit from BLAST. This gives a rough idea of how closely the two sequences match and therefore how closely related they are likely to be. Typically, More mismatches = More sequence differences = Greater evolutionary distance = Less closely related. 1. Fill in this information to answer question 2 in your post lab. 4. Add BLAST sequence data to your phylogenetic analysis by checking the boxes next to the accession numbers of your top 10 taxonomically unique (if possible) hits (rated by bit score). Avoid repeating the same species if possible, and avoid entries like ‘Uncultured Sp,’ ‘Uncultured Fungi,’ “Fungi sp.,’ and ‘Basidiomycota sp.’ etc. Once you select your BLAST hits, you cannot go back and choose different ones. So please ask your instructor if you are unsure. Once you have selected your Blast hits, click on ‘Add BLAST hits to project.’ 5. Skip the Upload Data stop. Next, click on ‘Reference Data’ to select data that will let you compare your barcode sequence to barcodes of other, more common species that 9 are closely related to your organism. This will be helpful for creating an obvious outgroup for your phylogenetic analyses. It will also help to place your organism into a larger taxonomic context. Use the table below to select the appropriate reference group for your specimen. If your sample is a(n): Reference Data Set to add Fish Common Fish Shark Sharks Fungus Fungi Plant Common Plants Insect or other invertebrate Common Invertebrates 6. Click on ‘Add ref data’ to add those sequences to your project. Close out the popup window. Stop 3 - Analyze Sequences: In many cases, an unknown species can be identified by a BLAST search. However, it can be very useful to create visual representations of the results of your BLAST search. These visual representations can help you gain a deeper understanding of your results and aid in their analysis. At DNA Subway stop 3, we will create two different kinds of visual representations of our data: A sequence alignment and phylogenetic trees. These visual representations will be very helpful in analyzing and interpreting our data. They can also add depth to our analyses by showing how closely your sequence is related to other organisms from your BLAST query based on differences in their DNA barcodes. Follow the procedures below to create a sequence alignment and some phylogenetic trees. 1. Under ‘Analyze Sequences,’ click on ‘Select Data,’ 2. Select ‘User Data’, ‘Blast hits’ and one species from your reference data to serve as an outgroup. Then click on ‘Save Selections.’ 3. Next, click on MUSCLE to align the sequences. MUSCLE stands for Multiple Sequence Comparison by Log Expectation. Basically what this program does is line up all the DNA sequences you’ve selected in the previous step into a nicely organized column. This is called an “alignment”. Once the program has finished, click on MUSCLE again to view the alignment. This alignment allows you to more easily visualize the similarities and differences between your selected sequences (see figure below). 10 Each row in this MUSCLE alignment represents the DNA sequence of an individual “hit’ from your BLAST search. The labels to the left of the alignment identify each sample. The numbers above the alignment (at intervals of 100) represent the base pair position in the sequence. The grey areas in each column represent base pairs that match in each sequence. The colored areas represent polymorphisms, or locations where the base pairs are different across sequences. Each color is representative of a specific nucleotide (A=green, T=red, C=blue, G=black) 4. Click on the + to zoom in, the - to zoom out, and ATCG to switch to nucleotides instead of colored bars. a. Scroll through your alignments to view similarities among sequences. Nucleotides are color-coded and each row of nucleotides is the barcode sequence of a single organism. Columns are matches (or mismatches) at a single nucleotide position across all sequences. Dashes (-) are gaps in the sequence, where nucleotides in one sequence are not represented in other sequences. i. ii. What is the length (in base pairs) of your alignment? __________bp Which sequence looks like it is the least similar (aka has the most base pair differences) to your DNA barcode sequence? iii. Which sequence(s) looks like it is the most similar to your DNA barcode sequence? b. You may notice that some sequences are longer than others. These will have grey areas at each end that extend beyond the edges of your DNA barcode sequence. In order to make fair comparisons between our sequences, we need to trim them all down to an equal size. Click on the ‘Trim Alignment’ button to trim off any unaligned ends. i. Unaligned ends are scored as mismatches by tree-building algorithms, so it is very important that they are removed. 11 ii. Take a screenshot of your Trimmed MUSCLE alignment for your lab notebook. c. Click on the ‘Sequence Similarity %’ button in the upper R corner of your MUSCLE alignment. This will open up a popup window that displays the % similarity between all of your samples. For example, a value of 99% means that only 1% of the base pairs are different between the two sequences being compared. Thus the higher the sequence similarity value, the more closely related these two sequences are, and presumably these two organisms. i. What is the percent similarity between your sequence and the closest match from you BLAST search? ___% ii. Name of closest match:_________ iii. What is the percent similarity between your sequence and the least similar match? _________% iv. Name of worst (least similar) match:__________________ This will serve as your OUTGROUP. d. Close the Sequence Similarity window and the MUSCLE alignment window to return to your project home page Visualizing your data using phylogenetic trees: Phylogenetic trees are another helpful way to visualize the relationships between your sequence and your BLAST hits. Follow the directions below to generate two different types of phylogenetic trees: Neighbor joining and Maximum likelihood. 5. Click on ‘PHYLIP NJ’ to generate a phylogenetic tree using the neighbor-joining (NJ) method. a. Click on PHYLIP NJ again to open a popup window with your neighbor-joining tree. It should look something like this: 12 b. Confirm that the proper outgroup has been selected for your NJ tree using the pulldown menu above the tree. The outgroup should be labeled by red text. c. Look at the scientific names of the five most closely related organisms on the tree. Are they in the same species, genus, family, class, or order? Whichever taxonomic level they have in common is the level to which you can feel confident in identifying your organism. Record this information in your postlab. i. What is the most specific taxonomic level that your NJ tree enabled you to identify your DNA barcode sequence to? (Ex. Species? Genus? Family? Order?)__________________ ii. What is the taxonomic information for the most closely related match to your DNA barcode?________________ d. Take a screenshot of your neighbor-joining tree for your lab notebook. Then close the window with your NJ tree and return to the project home page. 6. Click on ‘PHYLIP ML’ to generate a phylogenetic tree using the maximum-likelihood (ML) method. a. Click on PHYLIP ML again to the open the tree in a new window. The branch tips are the DNA sequences of individual taxa that you analyzed. Two or more branches are connected to each other by a node, which represents the common ancestor of the those taxa (see figure below). 13 b. The length of each horizontal branch is a measure of the evolutionary distance from the ancestral sequence at the node. Taxa with short horizontal branches from a node are closely related. Those with longer horizontal branches are more distantly related. For example, in the tree above, Ginkgo is the least related to the other taxa. c. At the top of the tree is a pulldown menu that allows the outgroup to be changed. Make sure to change the outgroup to the least similar taxa from your MUSCLE alignment. d. Find your DNA barcode sequence (unknown in this tree) and evaluate your sequence’s position in the tree. i. If your sequence is closely related to any of the reference or uploaded sequences, it will share a single node with those species. If your sequence is identical (or close to identical) to another sequence, the two will emerge directly from the vertical line of the node without any horizontal branching. ii. If your sequence is distantly related to all of the species in your tree, your sequence will sit on a branch by itself - with the other sequences grouping together as a clade iii. Look at the scientific names of sequences within the most closely associated clade. If all members share the same genus name, and no members of that genus are outside of that clade, you have identified your specimen as belonging to that genus. If members of that genus are represented outside that clade, check whether they belong to the same family or order. If all members of the clade are from the same family, you can identify your sample to the family level, but not to a specific genus. If all members of the clade belong to the same order, you can identify your specimen to the order level, but not to a specific family. Record this information in your postlab. 14 1. What is the most specific taxonomic level that your ML tree enabled you to identify your DNA barcode sequence to? (Ex. Species? Genus? Family? Order?)__________________ 2. What is the taxonomic information for the most closely related match to your DNA barcode?________________ e. Take a screenshot of your maximum likelihood tree for your lab notebook. Organism Natural History Do a google image or Wikipedia search for the 3 best matches from your BLAST query. Copy and paste the images into your Lab Notebook entry for the week (cite your sources - also copy and paste the website link). Below the image, note the geographic range for the organism. Which of these organism(s) looks most similar to your specimen? ________________________ Does the geographic range of the most similar species make sense if this is your species? ____ Organism Classification Based on the results from your BLAST search, MUSCLE alignment, phylogenetic trees, image search, and geographic range search, what is the most specific taxonomic level that you are confident that your specimen falls under? Kingdom Phylum Class Order Family Genus Species BEFORE LEAVING - please write a lab notebook entry for your DNA Subway project. Each individual should download and complete ONLY your own DNA Subway information (you don’t need to include the information for previous labs!) for the lab notebook entry. Items to include in your Lab Notebook 1. Title: A general title and the date 2. DNA Subway Blue Line Project Info: Your DNA Subway name and Blue Line project title 3. Consensus sequence: A screenshot of your consensus sequence (or 3' end of longest individual read). Be sure to label each image with the length of the sequence. 4. BLAST hits: A screenshot of your top 7-10 BLAST hits. This should include all columns (alignment length, bit score, e-value, and mismatches) for each hit. 5. MUSCLE alignment and % similarity: A screenshot of your trimmed MUSCLE alignment and a written statement explaining the % similarity between your sample and its closest match. 6. NJ tree: A screenshot of your NJ tree with the properly selected outgroup and YOUR SAMPLE HIGHLIGHTED. Be sure to state which taxon is most closely related to yours. 7. ML tree: A screenshot of your ML tree with the properly selected outgroup and YOUR SAMPLE HIGHLIGHTED. Be sure to state which taxon is most closely related to yours. 8. Google/Wikipedia images of top 3 BLAST hits: You can do less than three if there aren't 3 obviously different hits. 9. Organism ID: What is the most accurate taxonomic identity of your organism? 15