251 Lab 2 Chrisine

advertisement
Bio/CS-251
Laboratory 2
Examining A Single DNA Sequence
Jan 31, 2007
Christine Frielle
January 31, 2007
For this first lab using the bioinformatics tools that are found on the web we will follow
the last part of Chapter 5 of Bioinformatics for Dummies, henceforth abbreviated as BFD.
The first part of the chapter deals with “cleaning up” a sequence of DNA that a
microbiologist may have collected in the lab and also with designing PCR Primers. We
will discuss this latter topic at a future date as we approach our “wet lab” exercise. For
now, we will “borrow” a known data sequence from the NCBI web page:
http://www.ncbi.nlm.nih.gov/
The gene that we choose is the mutS/hMSH2 DNA repair gene. In addition to
following the readings and guided steps on pages 151 – 159, we will ask you to answer
some questions related to your findings.
First we give some background on this gene
mutS is the name given to a prokaryotic (bacterial) defender of the genome. (“mut” is an
abbreviation to reflect the increased rate at which DNA mutations accumulate in cells
that do not have this critical gene).
This gene is universal in that it is found in virtually every organism, both prokaryotic and
eukaryotic.
MSH2 is the name given to the eukaryotic (algae and fungi, plants and animals) version
of this gene (“MSH” is an abbreviation that means “MutS Homolog”). The term
“homolog” means that MSH2 looks and acts like the mutS gene, i.e., its structure is
similar to mutS and it plays a similar role in preventing mutations from occurring.
hMSH2: the prefix h in front of the gene name indicates that it is the human version of the
gene
Before we begin the lab, read Analyzing DNA Composition on page 151 of BFD and
answer these questions.
Q1:
We are analyzing a single sequence of DNA that represents the entire sample of
DNA that was obtained in the hypothetical lab. This sequence obviously
represents only one strand of the DNA that was extracted in the lab. The single
strand of DNA is denoted as cDNA (complementary DNA). How is cDNA
created? HINT: read the preface to Chapter 5 or GOOGLE cDNA.
cDNA stands for complementary DNA. It is the DNA copy of messenger RNA and
is single stranded.
Bio/CS-251
Q2:
Laboratory 2
Examining A Single DNA Sequence
Jan 31, 2007
Why is the pairing between guanosine and cytosine nucleotides more stable than
the pairing between adenosine and thymidine?
The G-C pairings are more stable because they are connected by three hydrogen
bonds. The A-T pairings are only connected by two hydrogen bonds.
Q3: If we know the G+C count, can we find the frequency of all of the bases in the
sample of DNA that was obtained in the lab? How is this done?
Pairings can be either G-C or A-T. If the frequency of G-C pairings is known, the
remainder of the sample must be composed of A-T pairings. Computer programs
such as Emboss can also be used.
OK, now on to the lab procedures
Procedure: Collect your sequence from NCBI
Go to the NCBI web site for GenBank given in the URL at the top of this page.
a. From the “Search” pull down menu, choose “Gene”
b. In the “For” window type “hMSH2” and click “Go”
c. Several references to the human versions of this gene are listed. Choose the
second entry, MSH2. Click on this entry
d. You will be taken to a page that contains a variety of information about
research that has been done on this gene. Peruse this page.
Q4: What is the complete name of this entry?
MSH2 mutS homolog 2, colon cancer, nonpolyposis type 1 (E. coli) from
Homo sapiens
Q5: How many papers have been written about this particular entry? HINT:
You will need to go to PubMed for this information. Follow the Links!
168 papers have been written about this entry.
Q6: As you scroll down the page you will come to a link to the GenBank page
that contains the DNA sequence itself. How many base pairs long is the
sequence for this entry?
80098 base pairs
Bio/CS-251
Laboratory 2
Examining A Single DNA Sequence
Jan 31, 2007
e. Scroll back up to the top of the GenBank page and from the Display pulldown
menu chose FASTA.
f. A new page will appear that contains the name of the entry and the listing of
the nucleotides in sequential order, but in a different format from the one at
the bottom of the previous page. Copy all of this information into a word
document that you will save in your workspace as MSH2.doc. You are now
ready to begin your analysis.
Procedure: follow pages 152 in BFD – Counting Words in DNA Sequences
Purpose: to find the count for each of the nucleotides found in this sequence and also to
find the count for each of four significant triplets found in the sequence.
g. After you obtain your result that will be formatted like Figure 5-4, copy and
paste it on a new page of your MSH2.doc.
Screen shot attached at end.
Q7:
What is the total G+C count for this sequence? Why are the
percentages of G and C that are shown so different? Is this a
violation of Chargaff’s rules?
The G-C content is 41.60%. The percentage for G is 21.33% and the
percentage for C is 20.27%. This is not a violation of Chargaff’s rules
because the sequence is for a single stranded cDNA sequence. If the
sequence was for a double stranded DNA segment, it would be expected
that the percentages would be equal.
Q8:
Give the total count for each of the nucleotides in the strand of
DNA represented by this sequence.
A: 20890
C: 16235
G: 17088
T: 25885
Bio/CS-251
Laboratory 2
Examining A Single DNA Sequence
Q9:
Jan 31, 2007
As you will learn, the triplets ATG, TAA, TAG, and TGA can
have a special significance in DNA sequences. What is the
frequency of each of these triplets in the sequence that you just
processed?
ATG: 1344
TAA: 1551
TAG: 1173
TGA: 1474
Procedure: Follow the instructions on pages 153 – 154 of BFD
Purpose: To search our sequence for the occurrence of any highly unusual repeat of a
long word (> 3 nucleotides in length)
The people who did the statistical analysis for the program BLAST (which we will begin
using next week) said that it was below any reasonable level of statistical significance
that any sequence of length 11 would be repeated solely by random assignment of the
four letters: A, C, G, or T. Therefore, we may conclude that the repeat of an 11 letter
word is a significant finding in our sequence. We will look for a repeated sequence, but
not push it as far as 11. We will go with 5.
h. Follow the instructions on pages 153 – 154 using a word length of 5. You
will have to recopy the sequence for MSH2 that you saved in your word
document. NOTE: In instruction 3 there is no link at “Codon usage,
composition”. Just find that section on the web page and go to instruction 4
on page 153.
Q10: How many 5 letter words are repeated 200 times or more in the
sequence for MSH2?
There are 32 5-letter words that are repeated 200 times or more in the
sequence.
Bio/CS-251
Laboratory 2
Examining A Single DNA Sequence
Jan 31, 2007
Q12: List (Copy and Paste) these sequence(s).
TTTTT
AAAAA
ATTTT
TTTTA
TTTTG
TATTT
TTTGT
TTTAA
TTTCT
TTTAT
AATTT
TGTTT
TTATT
CTTTT
GTTTT
GCTGG
AAAAT
TTTTC
TCTTT
TTCTT
CCTCC
TTGTT
TAATT
TTAAA
TAAAA
CTGGG
AAATT
GCCTC
TGGGA
TATAT
TTTGA
TCCCA
1270
589
482
415
346
343
289
283
277
273
270
269
266
266
255
245
245
242
242
240
235
235
234
226
226
217
214
212
205
205
202
201
Procedure:
Using a Dot-Plot to spot long words in a sequence.
Purpose:
To provide a streamlined visual method to perform the task of the previous
procedure.
i. Follow the instructions on pages 155 and 156 of BFD. The web page will not
download with your graph so scroll up so that the entire graph appears on you
screen. Then press ALT and Print Scrn at the same time. This will copy the
window that displays the graph. Paste (Ctrl and V) this on a new page of
Bio/CS-251
Laboratory 2
Examining A Single DNA Sequence
Jan 31, 2007
your WORD document. Save this document in a folder called Lab 1 on your
H drive. You should also save this completed Lab worksheet in that folder.
Sheet views are attached with different window sizes.
Q13:
Does this dot plot show any repeated word of significant length?
Think carefully before you answer this question.
The dark areas on the dot plot show words that are repeated a significant
number of times. The darker the area, the more times the word is repeated.
This makes sense because, earlier, there were found to be 32 five letter word
that were repeated more than 200 times.
An example of a repeated sequence with tragic consequences
Procedure: Using OMIM (Online Mendelian Inheritance in Man) to examine
a genetic disease caused by repeat sequences
Purpose:
Learn how to navigate OMIM
j. Go to http://www.ncbi.nlm.nih.gov/. Under “Search”, choose “Gene”,
and type “HD” into the search box. Open Link #2, “HD”. Read the
Summary, and then scroll to the bottom of the page. Under NCBI
Reference Sequences (RefSeq), open the link to the mRNA
sequence (NM_002111), then under “Display”, choose FASTA.
k. Examine the first six lines of the mRNA, and in the space below, record
a triplet sequence that is repeated in tandem more than 10 times:
Q14: Record your triplet repeat here:
The triplet CAG is repeated, in tandem, more than 10 times.
Q15: How many times is the triplet repeated (how many copies of the
triplet?)
The triplet is repeated 21 times.
l. Return to the NCBI Entrez Gene page for the HD gene. Under
“Additional Links”, select MIM:143100, and open this link to the OMIM
database for the HD gene. You will find that this is a long and detailed
summary of everything that is known about the HD gene and its
pathology. Answer each of the following questions briefly.
Bio/CS-251
Laboratory 2
Examining A Single DNA Sequence
Jan 31, 2007
Q16: What disease is caused by alterations in the HD gene?
What organ system is affected by this disease?
(You may wish to view the “Clinical Synopsis” from the Table of
Contents along the left border of the page)
Huntington Disease
Affects the central nervous system
Also with behavioral and psychiatric manifestations
Q17: From the Table of Contents, select “Allelic Variants”, read this
section, and answer the following question:
What is the molecular genetic basis for the disease? Explain
how repeat sequence variation is responsible for this disease.
The nucleotide sequence CAG is located in the region coding of the gene for
Huntington disease. The sequence is repeated between 9 and 37 times in
normal individuals, but between 37 and 86 times in affected individuals.
Because the sequence is repeated in a coding section of the gene, the protein
will contain extra amino acids that would not normally be present. These
extra amino acids affect the protein and its functions in the cells.
I affirm that I have upheld the highest principles of honesty and integrity in
my academic work and have not witnessed a violation of the Honor Code.
Christine Frielle
Download