Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 For this first lab using the bioinformatics tools that are found on the web we will follow the last part of Chapter 5 of Bioinformatics for Dummies, henceforth abbreviated as BFD. The first part of the chapter deals with “cleaning up” a sequence of DNA that a microbiologist may have collected in the lab and also with designing PCR Primers. We will discuss this latter topic at a future date as we approach our “wet lab” exercise. For now, we will “borrow” a known data sequence from the NCBI web page: http://www.ncbi.nlm.nih.gov/ The gene that we choose is the mutS/hMSH2 DNA repair gene. In addition to following the readings and guided steps on pages 138-, we will ask you to answer some questions related to your findings. First we give some background on this gene mutS is the name given to a prokaryotic (bacterial) defender of the genome. (“mut” is an abbreviation to reflect the increased rate at which DNA mutations accumulate in cells that do not have this critical gene). This gene is universal in that it is found in virtually every organism, both prokaryotic and eukaryotic. MSH2 is the name given to the eukaryotic (algae and fungi, plants and animals) version of this gene (“MSH” is an abbreviation that means “MutS Homolog”). The term “homolog” means that MSH2 looks and acts like the mutS gene, i.e., its structure is similar to mutS and it plays a similar role in preventing mutations from occurring. hMSH2: the prefix h in front of the gene name indicates that it is the human version of the gene Before we begin the lab, read Analyzing DNA Composition on page 138 of BFD and answer these questions. Q1: We are analyzing a single sequence of DNA that represents the entire sample of DNA that was obtained in the hypothetical lab. This sequence obviously represents only one strand of the DNA that was extracted in the lab. The single strand of DNA is denoted as cDNA (complementary DNA). How is cDNA created? HINT: read the preface to Chapter 5 or GOOGLE cDNA. cDNA is created by using reverse transcriptase. DNA is transcribed into mRNA, which is matured by adding a poly-A tail, and then reverse transcriptase can then make a complimentry strand to the RNA. This newly made strand is a complimentary form of DNA (cDNA). Q2: Why is the pairing between guanosine and cytosine nucleotides more stable than the pairing between adenosine and thymidine? Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Guanosine and cytosine are more stable because they make three hydrogen bonds with each other, where adenosine and thymidine only make two hydrogen bonds. Q3: If we know the G+C count, can we find the frequency of all of the bases in the sample of DNA that was obtained in the lab? How is this done? If one knows the G+C count, and because G+C are always paired together, and adenosine and thymidine are also always paired together, then if you know the percentage of one pair, the other pair must make up the other percentage out of the total. For example, if G+C make up 60%, then A+T must make up 40%. Each base is then half of that percentage because they are each part of a pair, G=30%, C=30%, A=20%, T=20%. OK, now on to the lab procedures Procedure: Collect your sequence from NCBI Go to the NCBI web site for GenBank given in the URL at the top of this page. a. From the “Search” pull down menu, choose “Gene” b. In the “For” window type “hMSH2” and click “Go” c. Several references to the human versions of this gene are listed. Choose the second entry, MSH2. Click on this entry d. You will be taken to a page that contains a variety of information about research that has been done on this gene. Peruse this page. Q4: What is the complete name of this entry? mutS homolog 2, colon cancer, nonpolyposis type 1 (E. coli) Q5: How many papers have been written about this particular entry? HINT: You will need to go to PubMed for this information. Follow the Links! 168 Q6: As you scroll down the page you will come to a link to the GenBank page that contains the DNA sequence itself. How many base pairs long is the sequence for this entry? 80098 e. Scroll back up to the top of the GenBank page and from the Display pulldown menu chose FASTA. f. A new page will appear that contains the name of the entry and the listing of the nucleotides in sequential order, but in a different format from the one at the bottom of the previous page. Copy all of this information into a word document that you will save in your workspace as MSH2.doc. You are now ready to begin your analysis. Procedure: follow pages 152 in BFD – Counting Words in DNA Sequences Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Purpose: to find the count for each of the nucleotides found in this sequence and also to find the count for each of four significant triplets found in the sequence. g. After you obtain your result that will be formatted like Figure 5-4, copy and paste it on a new page of your MSH2.doc. Q7: What is the total G+C count for this sequence? Why are the percentages of G and C that are shown so different? Is this a violation of Chargaff’s rules? G+C= 41.60% This seems to be a slight violation of Chargaff’s rule because the G and C contents are not complimentary in number. Chargaff’s Rule is not exact though, so the violation is minor, especially since the difference between the two could just be due to experimental error. Q8: Give the total count for each of the nucleotides in the strand of DNA represented by this sequence. G=21.33% C=20.27% A=26.08% T=32.32% Q9: As you will learn, the triplets ATG, TAA, TAG, and TGA can have a special significance in DNA sequences. What is the frequency of each of these triplets in the sequence that you just processed? ATG=1.68% TAA=1.94% TAG=1.46% TGA=1.84% Procedure: Follow the instructions on pages 153 – 154 of BFD Purpose: To search our sequence for the occurrence of any highly unusual repeat of a long word (> 3 nucleotides in length) The people who did the statistical analysis for the program BLAST (which we will begin using next week) said that it was below any reasonable level of statistical significance that any sequence of length 11 would be repeated solely by random assignment of the four letters: A, C, G, or T. Therefore, we may conclude that the repeat of an 11 letter word is a significant finding in our sequence. We will look for a repeated sequence, but not push it as far as 11. We will go with 5. h. Follow the instructions on pages 153 – 154 using a word length of 5. You will have to recopy the sequence for MSH2 that you saved in your word document. NOTE: In instruction 3 there is no link at “Codon usage, composition”. Just find that section on the web page and go to instruction 4 on page 153. Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Q10: How many 5 letter words are repeated 200 times or more in the sequence for MSH2? Q12: List (Copy and Paste) these sequence(s). Procedure: Using a Dot-Plot to spot long words in a sequence. Purpose: To provide a streamlined visual method to perform the task of the previous procedure. i. Follow the instructions on pages 155 and 156 of BFD. The web page will not download with your graph so scroll up so that the entire graph appears on you screen. Then press ALT and Print Scrn at the same time. This will copy the window that displays the graph. Paste (Ctrl and V) this on a new page of your WORD document. Save this document in a folder called Lab 1 on your H drive. You should also save this completed Lab worksheet in that folder. Q13: Does this dot plot show any repeated word of significant length? Think carefully before you answer this question. Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Above it is stated that the length must be 11 in order to be considered statistically significant. This means that dot plot does not show and words of significant length. An example of a repeated sequence with tragic consequences Procedure: Using OMIM (Online Mendelian Inheritance in Man) to examine a genetic disease caused by repeat sequences Purpose: Learn how to navigate OMIM j. Go to http://www.ncbi.nlm.nih.gov/. Under “Search”, choose “Gene”, and type “HD” into the search box. Open Link #2, “HD”. Read the Summary, and then scroll to the bottom of the page. Under NCBI Reference Sequences (RefSeq), open the link to the mRNA sequence (NM_002111), then under “Display”, choose FASTA. k. Examine the first six lines of the mRNA, and in the space below, record a triplet sequence that is repeated in tandem more than 10 times: Q14: Record your triplet repeat here: GCA Q15: How many times is the triplet repeated (how many copies of the triplet?) 21 times in a row l. Return to the NCBI Entrez Gene page for the HD gene. Under “Additional Links”, select MIM:143100, and open this link to the OMIM database for the HD gene. You will find that this is a long and detailed summary of everything that is known about the HD gene and its pathology. Answer each of the following questions briefly. Q16: What disease is caused by alterations in the HD gene? What organ system is affected by this disease? (You may wish to view the “Clinical Synopsis” from the Table of Contents along the left border of the page) Huntington Disease, Brain Q17: From the Table of Contents, select “Allelic Variants”, read this section, and answer the following question: What is the molecular genetic basis for the disease? Explain how repeat sequence variation is responsible for this disease. Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 CAG is repeated many times inside of the gene Huntingtin “which translates as a polyglutamine repeat in the protein product.” This causes the brain to slowly degenerate, often inducing psychotic and behavioral symptoms.