Dabbling in bioinformatics:

advertisement
CS 251 Introduction to Bioinformatics: Laboratory 2:
Dabbling in Bioinformatics:
Today, we will take our first real crack at using bioinformatic tools.
We will follow the flow of Chapter 3, Bioinformatics for Dummies (BFD), in which the authors pilot a
single gene (dUTPase, deoxyuridine 5’ triphosphate nucleotidylhydrolase) as a vehicle for touring
several genome databases and for learning some basic terminology and search tools.
To make this exercise more interesting for you, we will substitute a gene of our choosing, the
mutS/hMSH2 DNA repair gene, for this exercise. And, we will ask you to perform some additional
steps (e.g., Blastp) and answer a variety of questions, as you navigate this “road rally” through
databases, genomes, and search tools.
First, the essential gene terminology:
mutS is the name given to the prokaryotic (bacterial) version of this universal defender of the genome.
(“mut’ is an abbreviation to reflect the increased rate at which DNA mutations accumulate in cells that
lack this critical gene).
MSH2 is the name given to the eukaryotic (algae and fungi, plants, and animals) version of this gene.
(“MSH” is an abbreviation that means “MutS Homolog”). The term “homolog” means that the MSH2
gene looks and acts like the mutS gene, i.e., its structure (DNA and protein sequence) is similar to
mutS, and it plays a similar role in preventing mutations from occurring.
hMSH2: the prefix ‘h’ in front of a gene name indicates that it is the human version of the gene.
For some background, please obtain the PubMed abstracts of these two recent research articles
about the mutS/hMSH2 genes.
Ainsworth P, Koscinski D, Fraser B, Stuart J.
Family cancer histories predictive of a high risk of hereditary non-polyposis colorectal cancer
associate significantly with a genomic rearrangement in hMSH2 or hMLH1.
Clin Genet. 2004 Sep;66(3):183-188.
PMID: 15324316 [PubMed - as supplied by publisher]
Watson ME Jr, Burns JL, Smith AL.
Hypermutable Haemophilus influenzae with mutations in mutS are found in cystic fibrosis sputum.
Microbiology. 2004 Sep;150(Pt 9):2947-58.
PMID: 15347753 [PubMed - in process]
Please answer the following questions here:
From the abstract by Ainsworth P, Koscinski D, Fraser B, Stuart J.: HNPCC is a hereditary form
of colon cancer caused by defects in DNA repair genes, most notably the hMSH2 gene. About 1 in
200 of us will develop this cancer because we carry a defective copy of the hMSH2 gene. Are there
any bioinformatic tools, described in this paper, for predicting risk for this defect in human
populations? What is the name of this tool, and its location? At what institution was this tool
developed and housed?
From the abstract by Watson ME Jr, Burns JL, Smith AL: normally, bacteria lacking the mutS
gene are at a distinct disadvantage owing to the rapid accumulation of deleterious mutations in their
DNA. Why might this defect in DNA repair provide an advantage for human bacterial pathogens?
Procedure:
follow pp. 78-84 in BFD
Objective: Locate and study the E. coli mutS gene
Go to the GenBank entry tool at http://www.ncbi.nlm.nih.gov/entrez/
a. From the “Search” pull down menu, choose “Gene”.
b. Type the term ‘mutS E. coli’ in the “For” window and click “Go”.
c. Entries for a number of human versions of this gene are listed. However, nowhere on this list
will you find the E. coli mutS gene (strangely?!). Instead, scroll down the page until you find
the 14th entry. This will provide you with annotation for the mutS protein not from E.
coli, but from an other bacterium, Yersinia pestis. This lethal bacterium causes Bubonic
Plague (the “Black Death” made infamous by wiping out 1/3 of the population of
Europe in the 14th century). Open this ‘mutS’ hyperlink.
d. You will see a variety of information about the Y. pestis mutS gene, such as its chromosomal
location, neighboring genes, links to PubMed references, etc.
Q1: How many papers about the Y. pestis mutS gene have been published?
Q2: Does it appear that the Y. pestis genome has been completely sequenced?:
Q3: How large is the Y. pestis genome, and how many proteins can it encode?:
e. Back to the search for the E. coli mutS gene: go back to the Y. pestis mutS frontpage
mutS methyl-directed mismatch repair protein
(“
”), and scroll to
the bottom. Click on the “Protein” link, AAM84420, to obtain the amino acid (aa) sequence of
the Y. pestis mutS protein.
Q4:
f.
How many aa long is this protein?:
Open a new window, and go to the NCBI homepage http://www.ncbi.nlm.nih.gov/
g. From the dark blue line above the search window, choose “BLAST”, and then choose
“Protein-protein BLAST (blastp)”. You will now perform your first BLAST search, using the
Basic Local Alignment Sequence Tool. This tool allows you to rapidly search the entirety of
GenBank to locate genes and proteins that are related to your “Query” sequence. In this
case your Query sequence will be the Y. pestis mutS protein. Let’s see if we can use it to find
the E. coli mutS protein.
h. Copy/paste the entire Y. pestis mutS protein sequence into the Search window on the BLAST
page. Don’t worry about the numbers and the extra spaces – BLAST knows how to ignore
them.
i.
Click the blue “BLAST!” button to begin the search. A new screen will appear shortly
thereafter. On this new screen, click “FORMAT”. This will bring up a new window, and within
1-4 minutes the completed BLAST search report should appear. Be patient…get a drink of
water, get to know your neighbor…
j.
Interpreting the BLAST results:
(1) The first window will contain a graphical display of “hits” showing the relative similarity
between your query sequence and genes to which it is related. Suffice it to say that if the hits
are in red color, they represent proteins that are extremely similar to Y. pestis mutS.
(2) The second element of the report contains a list of the top 100 hits, in descending order of
similarity to your query protein. Each entry is listed on a single line, with a GenBank
accession number for each homolog hyperlinked so that you can get to it.
(3) The third element of the report contains alignments of your query protein (top line) to
similar “subject” proteins (bottom line). Entries between the query line and subject line
indicate aa residues that are identical between the two proteins, and also aa residues that are
conservatively substituted between the two proteins (indicated by a ‘+’ sign). The meaning
of “conservative substitution will be explained during today’s lab session.
k. Go to the third entry line in the report, and copy/paste the GenBank accession number here:
Q5: GB Accession #
l.
Go to the third alignment in the report, and copy/paste it here (preserve the alignment by using
a COURIER font at size 10). Make sure that the entry includes the top line with GenBank
accession numbers and other descriptors.
Q6: Paste in the alignment here
Q7: What species does the subject sequence come from?
Q8: Are the two proteins the same length? If not, what is the length of each?
Q9: Do the two proteins appear to be related? Does the alignment report contain a
quantitative indicator of relatedness? If so, what is the measure of their relatedness?
m. Click on the annotation link to obtain the sequence of the mutS homolog from E. coli. This will
give you a page of annotation about the protein only. We would like to know about the DNA
sequence as well. This additional information can be obtained by clicking on the hyperlink
associated with the DBSOURCE.
Q10: Paste the full page of results here:
Q11: Reading the header of a prokaryotic GenBank entry (p.79, BFD). Following the outline
on p.80, record below the LOCUS, DEFINITION, ACCESSION, VERSION,
KEYWORDS, SOURCE, ORGANISM, REFERENCE, and COMMENTS FOR THE E.
coli mutS gene:
LOCUS:
DEFINITION:
etc., etc.
Q12: p. 82, BFD – what does the term CDS mean?
Q13:
At what nucleotide number does the CDS begin in this GenBank entry?
Q14:
Paste in the first 300 nt of the nucleotide sequence here, and highlight (Bold) the start
codon:
n. We now know the nucleotide and protein sequences of the E. coli mutS gene, but the
annotation provided does not include the upstream regulatory sequences, i.e., the promoter for
recognition by RNA polymerase, and the Ribosome Binding Site (RBS) by which the ribosome
joins with the mRNA to begin translating the protein.
Let’s go and find these upstream regulatory elements. To do so, we will need to access a
much larger chunk of the E. coli genome, as follows:
o. To begin, follow the three steps at the bottom of p.82 and top of p. 83, BFD, to convert the
nucleotide sequence to a more universally acceptable format, called “FASTA”.
p. Paste the entire nucleotide sequence into a Nucleotide-Nucleotide BLAST search. To do this,
go back to your BLAST search page, paste the sequence into the Search window, and choose
“Nucleotide” from the blue menu bar, then begin the search by clicking the blue “BLAST”
button. “FORMAT” the BLAST search to retrieve the results in a new window, as
you did with the previous Protein-Protein BLAST search.
q. The results will show many independent GenBank entries containing the mutS DNA region.
This will illustrate just how redundant GenBank can be (cf. p. 84, BFD, the “Remember”
icon). The database often contains numerous entries of the same information, usually
because of independent submissions by different authors. Most of the top entries in this case
correspond to GenBank files that contain the entire E. coli genome or large chunks of the
genome.
r.
In this case, open AE016765.1, a “manageable” chunk consisting of 305,000 base pairs. This
is section 11 out of 18 of the complete E. coli genome. Give it a moment to fully load….this file
contains a lot of information!
As you scroll down, you will see sequential translations of every Open Reading Frame (ORF),
i.e., every potential gene, encoded in this large segment of DNA. At the bottom of the file, all
305,000 nucleotides are listed.
To make this a bit easier, we did some groundwork for you, and discovered that the mutS
gene lies somewhere between nt 115,840 and 118,752, and the ATG codon is at nt 116,191.
Q15: Copy/paste this 2912 nt sequence below (Courier 10 pt), and highlight (Bold) the Start
codon. NOTE: keep in mind that the ATG codon is not the same thing as the +1
transcription start site.
Q16: Locate the –35 promoter sequence. Highlight it (Bold) in the sequence you pasted in
above, and list its sequence here. NOTE: keep in mind that the –35 region may not
perfectly match the consensus sequence, but it will probably differ by no more than one
base from the consensus. For help, refer to the class lecture slides (Sept6_8).
Q17: Locate the –10 promoter sequence, keeping in mind that it also may not follow the exact
consensus. Highlight it (Bold) in the sequence above, and list it here:
Q18: Propose a likely startsite for transcription (+1), and highlight (Bold) its location above. At
what nucleotide (type of base and nt #) does transcription probably start?
Q18: Locate the Ribosome Binding Site (RBS), which has the consensus sequence
AGGAGGU in the mRNA. Highlight it (Bold). In what region of the mRNA transcript is the
RBS found? How far is the RBS from the start codon?
Q19: As oriented above, does this DNA sequence represent the coding strand or the template
strand? Is the DNA sequence as shown oriented from the 5’ end to the 3’ end, or from the
3’ end to the 5’ end, of the E. coli mutS gene? If necessary, refer to your class lecture
notes (Sept10).
Download