Name_____________________ Investigation: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Using bioinformatics as a tool to determine evolutionary relationships. Between 1990–2003, scientists working on an international research project known as the Human Genome Project were able to identify and map the 20,000–25,000 genes that define a human being. The project also successfully mapped the genomes of other species, including the fruit fly, mouse, and Escherichia coli. The location and complete sequence of the genes in each of these species are available for anyone in the world to access via the Internet. Why is this information important? Being able to identify the precise location and sequence of human genes will allow us to better understand genetic diseases. In addition, learning about the sequence of genes in other species helps us understand evolutionary relationships among organisms. Many of our genes are identical or similar to those found in other species. Suppose you identify a single gene that is responsible for a particular disease in fruit flies. Is that same gene found in humans? Does it cause a similar disease? It would take you nearly 10 years to read through the entire human genome to try to locate the same sequence of bases as that in fruit flies. This definitely isn’t practical, so a sophisticated technological method is needed. Bioinformatics is a field that combines statistics, mathematical modeling, and computer science to analyze biological data. Using bioinformatics methods, entire genomes can be quickly compared in order to detect genetic similarities and differences. An extremely powerful bioinformatics tool is BLAST, which stands for Basic Local Alignment Search Tool. Using BLAST, you can input a gene sequence of interest and search entire genomic libraries for identical or similar sequences in a matter of seconds. In this laboratory investigation, you will use BLAST to compare several genes, and then use the information to construct a cladogram. A cladogram (also called a phylogenetic tree) is a visualization of the evolutionary relatedness of species. A cladogram is treelike, with the endpoints of each branch representing a specific species. The closer two species are located to each other, the more recently they share a common ancestor. Cladrograms can also include additional details, such as the evolution of particular physical structures called shared derived characters. The placement of the derived characters corresponds to when (in a general, not a specific, sense) that character evolved; every species above the character label possesses that structure. Historically, only physical structures were used to create cladograms; however, modern-day cladistics relies heavily on genetic evidence as well. For example, chimpanzees and humans share 95%+ of their DNA, which would place them closely together on a cladogram. Humans and fruit flies share approximately 60% of their DNA, which would place them farther apart on a cladogram. PRE-LAB: 1) Use the following data to construct a cladogram of the major plant groups (“1” = characteristic is present within group): Organisms Mosses Pine trees Flowering plants Ferns Vascular Tissue 0 1 1 1 Flowers 0 0 1 0 Seeds 0 1 1 0 2) GAPDH (glyceraldehyde 3-phosphate dehydrogenase) is an enzyme that catalyzes the sixth step in glycolysis, an important reaction that produces molecules used in cellular respiration. The following data table shows the percentage similarity of this gene and the protein it expresses in humans versus other species. For example, according to the table, the GAPDH gene in chimpanzees is 99.6% identical to the gene found in humans, while the protein is identical. Species Chimpanzee (Pan troglodytes) Dog (Canis lupus familiaris) Fruit fly (Drosophila melanogaster) Roundworm (Caenorhabditis elegans) Gene Percentage Similarity 99.6% 91.3% 72.4% 68.2% Protein Percentage Similarity 100% 95.2% 76.7% 74.3% a) Why is the percentage similarity in the gene always lower than the percentage similarity in the protein for each species? (Hint: Recall how a gene is expressed to produce a protein.) b) Draw a cladogram depicting the evolutionary relationships among all five species (including humans) according to their percentage similarity in the GAPDH gene. PROCEDURE (Part One): A team of scientists has uncovered the fossil specimen near Liaoning Province, China. Make some general observations about the morphology (physical structure) of the fossil, and then record your observations in the space below. Little is known about the fossil. It appears to be a new species. Upon careful examination of the fossil, small amounts of soft tissue have been discovered. Normally, soft tissue does not survive fossilization; however, rare situations of such preservation do occur. Scientists were able to extract DNA nucleotides from the tissue and use the information to sequence several genes. Your task is to use BLAST to analyze these genes and determine the most likely placement of the fossil species on the cladogram below. Step 1: Make a hypothesis as to where you believe the fossil specimen should be placed on the cladogram based on the morphological observations that you made of the fossil. Mark (and label) your hypothesis on the cladogram above. OBSERVATIONS: _____________________________________________________________________________________ _____________________________________________________________________________________ _____________________________________________________________________________________ _____________________________________________________________________________________ Step 2: Download the sequences of the four gene samples taken from the unknown fossil. These can be found on my website under the ‘Labs & Lab Notebook’ link. Step 3: Use the following website for your genetic analysis— BLAST – use this website to compare gene sequences with genomic DNA from representative organism in a data base. http://blast.ncbi.nlm.nih.gov/Blast.cgi Step 4: Go to the BLAST website. Under ‘Basic BLAST’, click on ‘nucleotide blast.’ Copy-and-paste the gene sequence for FOSSIL GENE 1 into the ‘Enter Query Sequence’ box on the BLAST webpage. Under ‘Choose Search Set- database,’ make sure that “others (nr etc.)” is selected. Under ‘Program Selection,’ make sure that “Highly similar sequences (megablast)” is selected. Then, click the blue “BLAST” button to search for gene sequences in different species that are similar to the unknown fossil gene sequence. Step 5: When the results of the BLAST sequence comparison appear, scroll down to the section entitled ‘sequences producing significant alignments.’ The species in the list that appears below this section are those with sequences identical to (or most similar to) the gene of interest. The most similar sequences are listed first. You’ll need to click on the particular species listed and the ‘accession’ link, where you’ll find more info that includes the common name of the species, the # of nucleotides that match between the gene of interest and the known organism, etc. Using the information from your results, complete TABLE 1. REPEAT STEPS 4 & 5 FOR ALL FOUR FOSSIL GENES. TABLE 1 Fossil Gene # Most closely related organism (genus and species name) Most closely related species (common name) “Max Score” Number of matching nucleotides (fossil gene vs. organism gene) 1 / 2 / 3 / 4 / % nucleotide match (“Max Identity”) Next TWO most closely related organisms (common names only) Step 6: Based on what you’ve learned from the sequence analysis and what you know from the fossil structure itself, decide where the new fossil species belongs on the cladogram. Mark (and label) your results on the cladogram so that you may compare your results w/ your hypothesis. PROCEDURE (Part Two): Now that you’ve completed Part One of the investigation, you should feel more comfortable using BLAST. The next step is to learn how to find and BLAST your own genes of interest. To locate a gene, you will go to the following website: NCBI Gene – use this website to obtain gene sequences for analysis. http://www.ncbi.nlm.nih.gov/gene Step 1: Use the search tool at the top of this website to search for the sequences listed in TABLE 2. Step 2: Click on the first link that appears and scroll down to the “NCBI Reference Sequences.” Under “mRNA and Proteins,” click on the first file name. It will be named “NM_000257.2” or something similar. Step 3: Just below the gene title, click on “FASTA.” This is the name for a particular format for displaying sequences. Step 4: Copy the entire gene sequence, and then go to the BLAST website (see Procedure: Part One, Step 3.) Step 5: Under ‘Basic BLAST’, click on ‘nucleotide blast.’ Paste your gene sequence into the ‘Enter Query Sequence’ box on the BLAST webpage. Under ‘Choose Search Set,’ make sure that “others (nr etc.)” is selected. Under ‘Program Selection,’ make sure that “Highly similar sequences (megablast)” is selected. Then, click the blue “BLAST” button to search for gene sequences in different species that are similar to the human gene sequence of interest. Step 6: When the results of the BLAST sequence comparison appear, scroll down to the section entitled ‘sequences producing significant alignments.’ The species in the list that appears below this section are those with sequences identical to (or most similar to) the human gene of interest. The most similar sequences are listed first, as the higher “max score” usually indicates closer genetic relationships. ***For TABLE 2, exclude all Homo sapiens (human) DNA sequence matches. Choose the DNA sequence from the organism other than Homo sapiens within your BLAST results list that most closely matches the sequence of your human gene of interest. For example, Pan troglodytes is the scientific name of the common chimpanzee, where Pan is the genus name and troglodytes is the species name.*** Remember to click on the particular species listed and the ‘accession’ link, where you’ll find more info that includes the common name of the species, the # of nucleotides that match between the gene of interest and the known organism, etc. Using the information from your results, complete TABLE 2. REPEAT STEPS 1-6 FOR ALL FOUR PROVIDED HUMAN GENES OF INTEREST. Step 7: Think of a human protein NOT listed in the table, search for the gene sequence of this protein, run a BLAST comparison for this protein, and list all results in the final row of TABLE 2. TABLE 2 Human Gene Human Estrogen Receptor Most closely related organism (genus and species name) Most closely related species (common name) “Max Score” Number of matching nucleotides (human gene vs. organism gene) / Human Keratin 18 / Human Catalase / Human Myosin 7 (cardiac) / Human _______ _______ / % nucleotide match (“Max Identity”) Next TWO most closely related organisms (common names only) Analysis 1) Using your results from TABLE 2, sketch a hypothetical cladogram based upon gene sequence matches. Your cladogram should include humans and ALL animals listed within TABLE 2. 2) What is the function in humans of each of the proteins produced from the genes in TABLE 2? GENE PROTEIN FUNCTION Human Estrogen Receptor Human Keratin 18 Human Catalase Human Myosin 7 (cardiac) Human ____________ ____________ 3) Is it possible to find the same gene in two different kinds of organisms but not find the protein that is produced by that gene in both organisms? Why or why not? 4) If you found the same gene in all organisms you test, what does this suggest about the evolution of this gene in the history of life on earth? 5) Does the use of DNA sequences in the study of evolutionary relationships mean that other characteristics are unimportant in such studies? Explain your answer.