Dabbling in bioinformatics:

CS 251 Introduction to Bioinformatics: Laboratory 2: Dabbling in Bioinformatics: Today, we will take our first real crack at using bioinformatic tools. We will follow the flow of Chapter 3, Bioinformatics for Dummies (BFD), in which the authors pilot a single gene (dUTPase, deoxyuridine 5’ triphosphate nucleotidylhydrolase) as a vehicle for touring several genome databases and for learning some basic terminology and search tools. To make this exercise more interesting for you, we will substitute a gene of our choosing, the mutS/hMSH2 DNA repair gene, for this exercise. And, we will ask you to perform some additional steps (e.g., Blastp) and answer a variety of questions, as you navigate this “road rally” through databases, genomes, and search tools. First, the essential gene terminology: mutS is the name given to the prokaryotic (bacterial) version of this universal defender of the genome. (“mut’ is an abbreviation to reflect the increased rate at which DNA mutations accumulate in cells that lack this critical gene). MSH2 is the name given to the eukaryotic (algae and fungi, plants, and animals) version of this gene. (“MSH” is an abbreviation that means “MutS Homolog”). The term “homolog” means that the MSH2 gene looks and acts like the mutS gene, i.e., its structure (DNA and protein sequence) is similar to mutS, and it plays a similar role in preventing mutations from occurring. hMSH2: the prefix ‘h’ in front of a gene name indicates that it is the human version of the gene. For some background, please obtain the PubMed abstracts of these two recent research articles about the mutS/hMSH2 genes. Ainsworth P, Koscinski D, Fraser B, Stuart J. Family cancer histories predictive of a high risk of hereditary non-polyposis colorectal cancer associate significantly with a genomic rearrangement in hMSH2 or hMLH1. Clin Genet. 2004 Sep;66(3):183-188. PMID: 15324316 [PubMed - as supplied by publisher] Watson ME Jr, Burns JL, Smith AL. Hypermutable Haemophilus influenzae with mutations in mutS are found in cystic fibrosis sputum. Microbiology. 2004 Sep;150(Pt 9):2947-58. PMID: 15347753 [PubMed - in process] Please answer the following questions here: From the abstract by Ainsworth P, Koscinski D, Fraser B, Stuart J.: HNPCC is a hereditary form of colon cancer caused by defects in DNA repair genes, most notably the hMSH2 gene. About 1 in 200 of us will develop this cancer because we carry a defective copy of the hMSH2 gene. Are there any bioinformatic tools, described in this paper, for predicting risk for this defect in human populations? What is the name of this tool, and its location? At what institution was this tool developed and housed? From the abstract by Watson ME Jr, Burns JL, Smith AL: normally, bacteria lacking the mutS gene are at a distinct disadvantage owing to the rapid accumulation of deleterious mutations in their DNA. Why might this defect in DNA repair provide an advantage for human bacterial pathogens? Procedure: follow pp. 78-84 in BFD Objective: Locate and study the E. coli mutS gene Go to the GenBank entry tool at http://www.ncbi.nlm.nih.gov/entrez/ a. From the “Search” pull down menu, choose “Gene”. b. Type the term ‘mutS E. coli’ in the “For” window and click “Go”. c. Entries for a number of human versions of this gene are listed. However, nowhere on this list will you find the E. coli mutS gene (strangely?!). Instead, scroll down the page until you find the 14th entry. This will provide you with annotation for the mutS protein not from E. coli, but from an other bacterium, Yersinia pestis. This lethal bacterium causes Bubonic Plague (the “Black Death” made infamous by wiping out 1/3 of the population of Europe in the 14th century). Open this ‘mutS’ hyperlink. d. You will see a variety of information about the Y. pestis mutS gene, such as its chromosomal location, neighboring genes, links to PubMed references, etc. Q1: How many papers about the Y. pestis mutS gene have been published? Q2: Does it appear that the Y. pestis genome has been completely sequenced?: Q3: How large is the Y. pestis genome, and how many proteins can it encode?: e. Back to the search for the E. coli mutS gene: go back to the Y. pestis mutS frontpage mutS methyl-directed mismatch repair protein (“ ”), and scroll to the bottom. Click on the “Protein” link, AAM84420, to obtain the amino acid (aa) sequence of the Y. pestis mutS protein. Q4: f. How many aa long is this protein?: Open a new window, and go to the NCBI homepage http://www.ncbi.nlm.nih.gov/ g. From the dark blue line above the search window, choose “BLAST”, and then choose “Protein-protein BLAST (blastp)”. You will now perform your first BLAST search, using the Basic Local Alignment Sequence Tool. This tool allows you to rapidly search the entirety of GenBank to locate genes and proteins that are related to your “Query” sequence. In this case your Query sequence will be the Y. pestis mutS protein. Let’s see if we can use it to find the E. coli mutS protein. h. Copy/paste the entire Y. pestis mutS protein sequence into the Search window on the BLAST page. Don’t worry about the numbers and the extra spaces – BLAST knows how to ignore them. i. Click the blue “BLAST!” button to begin the search. A new screen will appear shortly thereafter. On this new screen, click “FORMAT”. This will bring up a new window, and within 1-4 minutes the completed BLAST search report should appear. Be patient…get a drink of water, get to know your neighbor… j. Interpreting the BLAST results: (1) The first window will contain a graphical display of “hits” showing the relative similarity between your query sequence and genes to which it is related. Suffice it to say that if the hits are in red color, they represent proteins that are extremely similar to Y. pestis mutS. (2) The second element of the report contains a list of the top 100 hits, in descending order of similarity to your query protein. Each entry is listed on a single line, with a GenBank accession number for each homolog hyperlinked so that you can get to it. (3) The third element of the report contains alignments of your query protein (top line) to similar “subject” proteins (bottom line). Entries between the query line and subject line indicate aa residues that are identical between the two proteins, and also aa residues that are conservatively substituted between the two proteins (indicated by a ‘+’ sign). The meaning of “conservative substitution will be explained during today’s lab session. k. Go to the third entry line in the report, and copy/paste the GenBank accession number here: Q5: GB Accession # l. Go to the third alignment in the report, and copy/paste it here (preserve the alignment by using a COURIER font at size 10). Make sure that the entry includes the top line with GenBank accession numbers and other descriptors. Q6: Paste in the alignment here Q7: What species does the subject sequence come from? Q8: Are the two proteins the same length? If not, what is the length of each? Q9: Do the two proteins appear to be related? Does the alignment report contain a quantitative indicator of relatedness? If so, what is the measure of their relatedness? m. Click on the annotation link to obtain the sequence of the mutS homolog from E. coli. This will give you a page of annotation about the protein only. We would like to know about the DNA sequence as well. This additional information can be obtained by clicking on the hyperlink associated with the DBSOURCE. Q10: Paste the full page of results here: Q11: Reading the header of a prokaryotic GenBank entry (p.79, BFD). Following the outline on p.80, record below the LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, ORGANISM, REFERENCE, and COMMENTS FOR THE E. coli mutS gene: LOCUS: DEFINITION: etc., etc. Q12: p. 82, BFD – what does the term CDS mean? Q13: At what nucleotide number does the CDS begin in this GenBank entry? Q14: Paste in the first 300 nt of the nucleotide sequence here, and highlight (Bold) the start codon: n. We now know the nucleotide and protein sequences of the E. coli mutS gene, but the annotation provided does not include the upstream regulatory sequences, i.e., the promoter for recognition by RNA polymerase, and the Ribosome Binding Site (RBS) by which the ribosome joins with the mRNA to begin translating the protein. Let’s go and find these upstream regulatory elements. To do so, we will need to access a much larger chunk of the E. coli genome, as follows: o. To begin, follow the three steps at the bottom of p.82 and top of p. 83, BFD, to convert the nucleotide sequence to a more universally acceptable format, called “FASTA”. p. Paste the entire nucleotide sequence into a Nucleotide-Nucleotide BLAST search. To do this, go back to your BLAST search page, paste the sequence into the Search window, and choose “Nucleotide” from the blue menu bar, then begin the search by clicking the blue “BLAST” button. “FORMAT” the BLAST search to retrieve the results in a new window, as you did with the previous Protein-Protein BLAST search. q. The results will show many independent GenBank entries containing the mutS DNA region. This will illustrate just how redundant GenBank can be (cf. p. 84, BFD, the “Remember” icon). The database often contains numerous entries of the same information, usually because of independent submissions by different authors. Most of the top entries in this case correspond to GenBank files that contain the entire E. coli genome or large chunks of the genome. r. In this case, open AE016765.1, a “manageable” chunk consisting of 305,000 base pairs. This is section 11 out of 18 of the complete E. coli genome. Give it a moment to fully load….this file contains a lot of information! As you scroll down, you will see sequential translations of every Open Reading Frame (ORF), i.e., every potential gene, encoded in this large segment of DNA. At the bottom of the file, all 305,000 nucleotides are listed. To make this a bit easier, we did some groundwork for you, and discovered that the mutS gene lies somewhere between nt 115,840 and 118,752, and the ATG codon is at nt 116,191. Q15: Copy/paste this 2912 nt sequence below (Courier 10 pt), and highlight (Bold) the Start codon. NOTE: keep in mind that the ATG codon is not the same thing as the +1 transcription start site. Q16: Locate the –35 promoter sequence. Highlight it (Bold) in the sequence you pasted in above, and list its sequence here. NOTE: keep in mind that the –35 region may not perfectly match the consensus sequence, but it will probably differ by no more than one base from the consensus. For help, refer to the class lecture slides (Sept6_8). Q17: Locate the –10 promoter sequence, keeping in mind that it also may not follow the exact consensus. Highlight it (Bold) in the sequence above, and list it here: Q18: Propose a likely startsite for transcription (+1), and highlight (Bold) its location above. At what nucleotide (type of base and nt #) does transcription probably start? Q18: Locate the Ribosome Binding Site (RBS), which has the consensus sequence AGGAGGU in the mRNA. Highlight it (Bold). In what region of the mRNA transcript is the RBS found? How far is the RBS from the start codon? Q19: As oriented above, does this DNA sequence represent the coding strand or the template strand? Is the DNA sequence as shown oriented from the 5’ end to the 3’ end, or from the 3’ end to the 5’ end, of the E. coli mutS gene? If necessary, refer to your class lecture notes (Sept10).

Dabbling in bioinformatics:

Related documents

Products

Support

Dabbling in bioinformatics:

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib