MCB 320H March 28 Bioinformatics is the science of managing and analyzing biological data, particularly DNA sequence data, using advanced computing programs. In this exercise, our goal is to analyze the function and structure of a protein of interest by analyzing a sequence of DNA. First, we will take a DNA sequence and determine its protein coding capability. This is done by entering a DNA sequence and going to the ORF (open reading frame) database to find a subset of the sequenced piece of DNA that begins with an initiation codon (methionine ATG) codon and ends with a nonsense codon. These ORFs have the potential to encode for our proteins of interest. Next, we will compare our sequence to sequences found in a universal database provided by the National Center for Biotechnology Information (NCBI) and interpret the output from this program. Along the way, you will see how analyzing sequences is incorporated into the biological sciences literature databases (PubMed). The NCBI sequence database can be linked to PubMed to find data concerning the specific function of a protein. PubMed has numerous citations and abstracts from published literature referenced by genetic sequence records. Let ‘s say you have discovered a strain of mice that have a very high rate of skin tumors. You first try to perform genetic analysis on this strain, but the phenotype does not appear to be due to a single gene. Next, you try QTL analysis but do not find any intervals with significant LOD scores. Eureka, let’s try a molecular biology approach and isolate mRNAs from the skin cells. You clone several thousand cDNAs and sequence a bunch. Many seem to belong to a single gene based on overlapping sequences, which you assemble into the following sequence: atcgctgtcg cagggccccg aagacacctg ctgaagggaa tgacagatca atggcatccg gcattggtga actgcactgc tcacgcgcac taacaggctt agaacctaga ttggcctgaa tgatcatttc tcgggacacc ccgtgaacca gggactgtgt tcctggaggg aatgtctgcc agtgtgccca gagagaacaa acgccaactg ctgggccaaa tggtggccct tacgccgcct caaaccaagc gttcgggagc aaatcccggt tccttgacga gcatctgtct tggcaggtcc agagagtgac cccaccactc gtacagcttt tggctcatgt caagtgtaaa atttaaagac catcagcggg tcctcctcta tttgctgatt aataatacgt catcacatca tggaaaccga caatcagaaa cgtctgcaat ctcctgccag ggaaccaagg ccaggccatg ctacattgat cactctggtc tacctatgga gataccatct tgggattggc gcttcaagag ccacttgagg atttggcaca ggccatcaag agcctatgtg gacctccact cccagtgact tgtctggtct atgctgtaca ggtgccacct gtccgagcct aaatgtgatg acactctcca gaccttcaca gacccacgag caggcttggc ggcagaacaa ctggggctgc aatttgtgct accaaaatca cctttatgct aatgtgagca gagtttgtgg aacatcacct ggcccacact tggaagtatg tgtgctgggc attgccactg ctattcatgc agagagctcg atattaaagg gtgtataagg gagttaagag atggctagtg gtccagctca gctgccacaa gccaaaagtt accccaccac gtgtgaagaa gtgggcctga ggccctgtcg taaatgctac tcctgccagt aactagaaat ctgataactg agcaacatgg gttccctcaa acgcaaacac tgaacaacag cctcggaagg gaggcaggga aaaattctga gtacaggcag gtgtcaagac cagatgccaa caggtcttca ggattgtggg gaagacgtca tggaacctct aaacagaatt gtctctggat aagccacatc tggacaaccc ttacacagct ccaatgtgct ccaagatgag ctatcagatg gtgcccccga ctactacgaa caaagtttgt aaacatcaaa ggcctttaag tctaaaaacc gactgacctc tcagttttct ggagatcagt aataaactgg agctgagaaa ctgctggggc gtgcgtggag atgcatccag gggaccagac ctgcccagct taatgtctgc aggatgtgaa tggcctcctc cattgttcga cacacccagc caaaaagatc cccagaaggt tccaaaagcc tcatgtatgc catgccctac gcggggtgta gccacatgca gatgtcaacc aactacgtgg gtggaagaag aatggcatag cacttcaaat ggggattctt gtaaaggaaa catgctttcg ttggcggtcg gatggggatg aaaaaactct gactgcaagg cctgagccca aaatgcaaca tgccatccag aactgcatcc ggcatcatgg cacctatgcc gtgtggccat ttcatagtgg aagcgtacac ggagaagctc aaagttctgg gagaaagtaa aacaaagaaa cgcctcctgg ggttgcctcc tggactacgt tgcagattgc cagccaggaa ccaaactgct agtggatggc gctatggtgt cagcaagtga gcaccatcga caaagttccg ttgttatcca gagccctgat cacagcaagg gtgcaactag aagaagacgc acatagatga cagcaggctc gagacctgca ctgcccagcc gcagtcacca ccaagccaaa cacctccaag atgtctggac atgcccacgc ccgagaacac aaagggcatg tgtactggtg tggtgctgaa tttggaatca cactgtgtgg catctcatcc tgtctacatg agagttgatt gggggatgaa ggatgaagag cttcttcaac caacaattcc cttcttgcag cgcattcctc tgtgcagaac ttatcaaaat tacctgtctc aatgagccta tggcatattt cagtgagttt tttctagaat tgtgtcaaat aaggacaaca aactacctgg aagacaccac gagaaagaat attttacacc gaactgatga atcctagaga atcatggtca cttgaattct agaatgcatt gacatggagg agcccgtcca actgtggctt cggtacagct cctgtacctg cctgtctatc ccccacagca agtagtgggt gacaaccctg aagggcccca attggagcat cccaggacca gtcactcaga ttggctccca aagatcggcg agcatgtcaa atcatgccga gaatttatac cctttgggtc aaggagagcg agtgctggat ccaaaatggc tgccaagccc atgtagttga cgtcgaggac gcattaatag ccgaccccac aatatgtaaa acaatcagcc atgcagtggg ttaacagccc actaccagca cagctgaaaa gacaagaagg actatggcag ctggctttaa gtacctcctc tttggtgcac gatcacagat ggggggcaaa acaccaaagt caagccttat ccttccacag gatagatgct ccgagaccca tacagactcc tgctgatgag tcccctcttg aaatgggagc aggtgctgta ccaatctgtt cctgcatcca caaccctgag tgcactctgg ggacttcttc tgcagagtac ggcatcatac cacctccact agcataactc aactggtgtg cgtgacttgg tttgggctgg gtgcctatca gatgtctgga gatggaatcc ccacctatct gatagccgcc cagcgctacc aacttttacc tatcttatcc agttctctga tgccgtgtca acagaggaca cccaagaggc gctcctggaa tatctcaaca atccagaaag cccaaggaaa ctacgggtgg cagctataaa tctggtagcc tgatgggctt How do you figure out what gene is encoded by this DNA sequence? Copy the sequence-don’t worry, the spaces won’t matter. First, let’s see if there is an ORF. An ORF is a section of a sequenced piece of DNA or cDNA that begins with an initiation codon (methionone ATG) and ends with a nonsense codon (TAA, TAG, and TGA). The ORF is defined by the placement of start and stop codons. These are the sites on the mRNA where translation starts and stops. The region between the start and stop codons determines the protein sequence. No go to the ORF finder webpage: http://www.ncbi.nlm.nih.gov/gorf/gorf.html Copy the sequence into the box and click on the OrfFind button. Which reading frame has the longest ORF? How long of a predicted protein is it? What calculation do you do to figure this out? Now click on the green bar representing the longest ORF. You now get to a page where you can run a BLAST search. 1. BLAST (Basic Local Alignment Search Tool), will do a comparison of your DNA sequence with nucleotide and protein sequences. BLAST can be used to analyze functional and evolutionary relationships between sequences as well as identify gene families. There are five different types of BLAST programs. a. blastp, compares an amino acid query sequence against a protein sequence database. b. blastn, compares a nucleotide query sequence against a nucleotide sequence database. c. blastx, compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. d. tblastn, compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). e. tblastx, compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. This ORF BLAST page is specialized for running blastp or tblastn searches. Go ahead and click the BLAST button, which will do a BLASTp search against the nr (nonredundant database, which is a database where each sequence is only represented once. A new page will open, and to see the data click on “View Report”. There is a reference at the top of the page, and then a list of how many sequences are in the database and these were each compared to your protein that was entered into the database. Scroll down and you will see a color-coded graphic of sequences similar to yours. Red means a “hot” or very similar sequence. The length of the bars indicate the extent of the similarity from amino terminus to carboxy terminus. Scroll down further and you will see the individual sequences that are the best match. Below that are the alignments. Click on : ref|NP_997538.1| You get the protein entry from Genbank, with a few references on the gene encoded by your DNA sequence. What is the gene and what does it do? Scroll down and look at the first alignment. How many of the amino acids are identical in the first alignment? There is a second alignment in which the computer is taking a repeat sequence and trying to align it to a less-conserved sequence. You can ignore that but note that this can be a problem with domains that have repeats or are repeated. What species is this sequence from? Scroll down until you find three more species. What are they (common name) and how are they related to mice? Now go back up to the top of the page and click on Show Conserved Domains. This is a graphical representation of the domains found in this protein. What are the domains? What do you think each of these do? Click around a little and see what more you can find out. This is all you have to do. Other useful sites include the NCBI blast site: http://www.ncbi.nlm.nih.gov/blast/Blast.cgi where you can do all six kinds of BLAST searches, and the EXPASY site: http://us.expasy.org/ if you want to look more at protein domains in sequences. There are also lots of nice tools in the Structure bitton on the NCBI page.