Bioinformatics - Brief - University of Arizona

advertisement
MCB 320H
March 28
Bioinformatics is the science of managing and analyzing biological data, particularly
DNA sequence data, using advanced computing programs. In this exercise, our goal is to
analyze the function and structure of a protein of interest by analyzing a sequence of
DNA. First, we will take a DNA sequence and determine its protein coding capability.
This is done by entering a DNA sequence and going to the ORF (open reading frame)
database to find a subset of the sequenced piece of DNA that begins with an initiation
codon (methionine ATG) codon and ends with a nonsense codon. These ORFs have the
potential to encode for our proteins of interest. Next, we will compare our sequence to
sequences found in a universal database provided by the National Center for
Biotechnology Information (NCBI) and interpret the output from this program. Along the
way, you will see how analyzing sequences is incorporated into the biological sciences
literature databases (PubMed). The NCBI sequence database can be linked to PubMed to
find data concerning the specific function of a protein. PubMed has numerous citations
and abstracts from published literature referenced by genetic sequence records.
Let ‘s say you have discovered a strain of mice that have a very high rate of skin tumors.
You first try to perform genetic analysis on this strain, but the phenotype does not appear
to be due to a single gene. Next, you try QTL analysis but do not find any intervals with
significant LOD scores. Eureka, let’s try a molecular biology approach and isolate
mRNAs from the skin cells. You clone several thousand cDNAs and sequence a bunch.
Many seem to belong to a single gene based on overlapping sequences, which you
assemble into the following sequence:
atcgctgtcg
cagggccccg
aagacacctg
ctgaagggaa
tgacagatca
atggcatccg
gcattggtga
actgcactgc
tcacgcgcac
taacaggctt
agaacctaga
ttggcctgaa
tgatcatttc
tcgggacacc
ccgtgaacca
gggactgtgt
tcctggaggg
aatgtctgcc
agtgtgccca
gagagaacaa
acgccaactg
ctgggccaaa
tggtggccct
tacgccgcct
caaaccaagc
gttcgggagc
aaatcccggt
tccttgacga
gcatctgtct
tggcaggtcc
agagagtgac
cccaccactc
gtacagcttt
tggctcatgt
caagtgtaaa
atttaaagac
catcagcggg
tcctcctcta
tttgctgatt
aataatacgt
catcacatca
tggaaaccga
caatcagaaa
cgtctgcaat
ctcctgccag
ggaaccaagg
ccaggccatg
ctacattgat
cactctggtc
tacctatgga
gataccatct
tgggattggc
gcttcaagag
ccacttgagg
atttggcaca
ggccatcaag
agcctatgtg
gacctccact
cccagtgact
tgtctggtct
atgctgtaca
ggtgccacct
gtccgagcct
aaatgtgatg
acactctcca
gaccttcaca
gacccacgag
caggcttggc
ggcagaacaa
ctggggctgc
aatttgtgct
accaaaatca
cctttatgct
aatgtgagca
gagtttgtgg
aacatcacct
ggcccacact
tggaagtatg
tgtgctgggc
attgccactg
ctattcatgc
agagagctcg
atattaaagg
gtgtataagg
gagttaagag
atggctagtg
gtccagctca
gctgccacaa
gccaaaagtt
accccaccac
gtgtgaagaa
gtgggcctga
ggccctgtcg
taaatgctac
tcctgccagt
aactagaaat
ctgataactg
agcaacatgg
gttccctcaa
acgcaaacac
tgaacaacag
cctcggaagg
gaggcaggga
aaaattctga
gtacaggcag
gtgtcaagac
cagatgccaa
caggtcttca
ggattgtggg
gaagacgtca
tggaacctct
aaacagaatt
gtctctggat
aagccacatc
tggacaaccc
ttacacagct
ccaatgtgct
ccaagatgag
ctatcagatg
gtgcccccga
ctactacgaa
caaagtttgt
aaacatcaaa
ggcctttaag
tctaaaaacc
gactgacctc
tcagttttct
ggagatcagt
aataaactgg
agctgagaaa
ctgctggggc
gtgcgtggag
atgcatccag
gggaccagac
ctgcccagct
taatgtctgc
aggatgtgaa
tggcctcctc
cattgttcga
cacacccagc
caaaaagatc
cccagaaggt
tccaaaagcc
tcatgtatgc
catgccctac
gcggggtgta
gccacatgca
gatgtcaacc
aactacgtgg
gtggaagaag
aatggcatag
cacttcaaat
ggggattctt
gtaaaggaaa
catgctttcg
ttggcggtcg
gatggggatg
aaaaaactct
gactgcaagg
cctgagccca
aaatgcaaca
tgccatccag
aactgcatcc
ggcatcatgg
cacctatgcc
gtgtggccat
ttcatagtgg
aagcgtacac
ggagaagctc
aaagttctgg
gagaaagtaa
aacaaagaaa
cgcctcctgg
ggttgcctcc
tggactacgt
tgcagattgc
cagccaggaa
ccaaactgct
agtggatggc
gctatggtgt
cagcaagtga
gcaccatcga
caaagttccg
ttgttatcca
gagccctgat
cacagcaagg
gtgcaactag
aagaagacgc
acatagatga
cagcaggctc
gagacctgca
ctgcccagcc
gcagtcacca
ccaagccaaa
cacctccaag
atgtctggac
atgcccacgc
ccgagaacac
aaagggcatg
tgtactggtg
tggtgctgaa
tttggaatca
cactgtgtgg
catctcatcc
tgtctacatg
agagttgatt
gggggatgaa
ggatgaagag
cttcttcaac
caacaattcc
cttcttgcag
cgcattcctc
tgtgcagaac
ttatcaaaat
tacctgtctc
aatgagccta
tggcatattt
cagtgagttt
tttctagaat
tgtgtcaaat
aaggacaaca
aactacctgg
aagacaccac
gagaaagaat
attttacacc
gaactgatga
atcctagaga
atcatggtca
cttgaattct
agaatgcatt
gacatggagg
agcccgtcca
actgtggctt
cggtacagct
cctgtacctg
cctgtctatc
ccccacagca
agtagtgggt
gacaaccctg
aagggcccca
attggagcat
cccaggacca
gtcactcaga
ttggctccca
aagatcggcg
agcatgtcaa
atcatgccga
gaatttatac
cctttgggtc
aaggagagcg
agtgctggat
ccaaaatggc
tgccaagccc
atgtagttga
cgtcgaggac
gcattaatag
ccgaccccac
aatatgtaaa
acaatcagcc
atgcagtggg
ttaacagccc
actaccagca
cagctgaaaa
gacaagaagg
actatggcag
ctggctttaa
gtacctcctc
tttggtgcac
gatcacagat
ggggggcaaa
acaccaaagt
caagccttat
ccttccacag
gatagatgct
ccgagaccca
tacagactcc
tgctgatgag
tcccctcttg
aaatgggagc
aggtgctgta
ccaatctgtt
cctgcatcca
caaccctgag
tgcactctgg
ggacttcttc
tgcagagtac
ggcatcatac
cacctccact
agcataactc
aactggtgtg
cgtgacttgg
tttgggctgg
gtgcctatca
gatgtctgga
gatggaatcc
ccacctatct
gatagccgcc
cagcgctacc
aacttttacc
tatcttatcc
agttctctga
tgccgtgtca
acagaggaca
cccaagaggc
gctcctggaa
tatctcaaca
atccagaaag
cccaaggaaa
ctacgggtgg
cagctataaa
tctggtagcc
tgatgggctt
How do you figure out what gene is encoded by this DNA sequence? Copy the
sequence-don’t worry, the spaces won’t matter.
First, let’s see if there is an ORF. An ORF is a section of a sequenced piece of DNA or
cDNA that begins with an initiation codon (methionone ATG) and ends with a nonsense
codon (TAA, TAG, and TGA). The ORF is defined by the placement of start and stop
codons. These are the sites on the mRNA where translation starts and stops. The region
between the start and stop codons determines the protein sequence.
No go to the ORF finder webpage: http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Copy the sequence into the box and click on the OrfFind button. Which reading frame
has the longest ORF? How long of a predicted protein is it? What calculation do you do
to figure this out?
Now click on the green bar representing the longest ORF. You now get to a page where
you can run a BLAST search.
1. BLAST (Basic Local Alignment Search Tool), will do a comparison of your DNA
sequence with nucleotide and protein sequences. BLAST can be used to analyze
functional and evolutionary relationships between sequences as well as identify
gene families. There are five different types of BLAST programs.
a. blastp, compares an amino acid query sequence against a protein sequence
database.
b. blastn, compares a nucleotide query sequence against a nucleotide sequence
database.
c. blastx, compares the six-frame conceptual translation products of a
nucleotide query sequence (both strands) against a protein sequence
database.
d. tblastn, compares a protein query sequence against a nucleotide sequence
database dynamically translated in all six reading frames (both strands).
e. tblastx, compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.
This ORF BLAST page is specialized for running blastp or tblastn searches. Go ahead
and click the BLAST button, which will do a BLASTp search against the nr (nonredundant database, which is a database where each sequence is only represented once.
A new page will open, and to see the data click on “View Report”.
There is a reference at the top of the page, and then a list of how many sequences are in
the database and these were each compared to your protein that was entered into the
database.
Scroll down and you will see a color-coded graphic of sequences similar to yours. Red
means a “hot” or very similar sequence. The length of the bars indicate the extent of the
similarity from amino terminus to carboxy terminus.
Scroll down further and you will see the individual sequences that are the best match.
Below that are the alignments. Click on : ref|NP_997538.1| You get the protein entry
from Genbank, with a few references on the gene encoded by your DNA sequence. What
is the gene and what does it do?
Scroll down and look at the first alignment. How many of the amino acids are identical
in the first alignment?
There is a second alignment in which the computer is taking a repeat sequence and trying
to align it to a less-conserved sequence. You can ignore that but note that this can be a
problem with domains that have repeats or are repeated.
What species is this sequence from? Scroll down until you find three more species.
What are they (common name) and how are they related to mice?
Now go back up to the top of the page and click on Show Conserved Domains. This is a
graphical representation of the domains found in this protein. What are the domains?
What do you think each of these do? Click around a little and see what more you can
find out.
This is all you have to do. Other useful sites include the NCBI blast site:
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
where you can do all six kinds of BLAST searches, and the EXPASY site:
http://us.expasy.org/
if you want to look more at protein domains in sequences. There are also lots of nice
tools in the Structure bitton on the NCBI page.
Download