Document

advertisement
Understanding
Basic Local Alignment Search Tool
Muhammad Awais
PhD Biochemistry
08-ARID-1103
DNA Sequencing
Bioinformatics is based on the fact that DNA
sequencing is cheap, and becoming easier and
cheaper very quickly.
The Human Genome Project cost roughly $3
billion and took 12 years (1991-2003).
Sequencing James Watson’s genome in 2007
cost $2 million and took 2 months
Today, you could get your genome
sequenced for about $100,000 and it would
take a month.
•
•
•
•
To extract information from the genome is difficult. How to convert a
string of ACGT’s into knowledge of how the organism works is hard.
Most of the work is on the computer, with key confirming experiments
done in the “wet lab”.
The sequence below contains a gene critical for life: the gene that initiates
replication of the DNA. Can you spot it?
We are now going to spend some time on what genes look like and how
we can find them.
TTGGAAAACATTCATGATTTATGGGATAGAGCTTTAGATCAAATTGAAAAAAAATTAAGCAAACCTAGTTTTGAAACCTG
GCTCAAATCGACAAAAGCTCATGCTTTACAAGGAGACACGCTCATTATTACTGCACCTAATGATTTTGCACGGGACTGGT
TAGAATCTAGGTATTCTAATTTAATTGCTGAAACACTTTATGATCTTACGGGGGAAGAGTTAGATGTAAAATTTATTATT
CCTCCTAACCAGGCCGAGGAAGAATTCGATATTCAAACTCCTAAAAAGAAAGTCAATAAAGACGAAGGAGCAGAATTTCC
TCAAAGCATGCTAAATTCGAAGTATACCTTTGATACATTTGTTATCGGATCTGGAAATCGGTTTGCGCATGCAGCTTCTT
TAGCAGTAGCAGAAGCGCCGGCTAAAGCGTATAATCCGCTTTTTATTTACGGGGGAGTAGGATTAGGCAAAACACACTTA
ATGCACGCCATAGGCCACTATGTGTTAGATCATAATCCTGCCGCGAAAGTCGTGTACTTATCATCTGAAAAATTCACAAA
CGAGTTTATTAACTCTATTCGTGACAATAAAGCAGTAGAATTCCGCAACAAATACCGTAATGTAGATGTTTTACTGATTG
ATGATATTCAATTCTTAGCAGGTAAAGAGCAGACACAAGAAGAATTTTTCCATACGTTTAATACGCTTCACGAAGAAAGC
AAGCAGATTGTCATCTCAAGTGATCGACCGCCGAAAGAAATTCCTACACTTGAAGATCGACTTCGCTCTCGCTTTGAATG
GGGCCTTATTACAGACATCACACCACCAGATTTGGAAACACGAATTGCTATTTTGCGTAAAAAAGCCAAAGCGGACGGCT
TAGTTATTCCAAATGAAGTTATGCTTTATATCGCCAATCAGATTGATTCAAATATTAGAGAATTAGAAGGCGCACTTATT
Genes and Proteins
• Most genes code for proteins: each gene contains
the information necessary to make one protein.
• Proteins are the most important type of
macromolecule.
– Structure: collagen in skin, keratin in hair, crystallin
in eye.
– Enzymes: all metabolic transformations, building
up, rearranging, and breaking down of organic
compounds, are done by enzymes, which are
proteins.
– Transport: oxygen in the blood is carried by
hemoglobin, everything that goes in or out of a cell
(except water and a few gasses) is carried by
proteins.
The Genetic Code
•
•
•
•
•
•
Proteins are long chains of amino acids.
There are 20 different amino acids coded in DNA
There are only 4 DNA bases, so you need 3 DNA
bases to code for the 20 amino acids
– 4 x 4 x 4 = 64 possible 3 base combinations
(codons)
– Each codon codes for one amino acid
– Most amino acids have more than one possible
codon
Genes start at a start codon and end at a stop codon.
3 codons are stop codons: all genes end at a stop
codon.
Start codons are a bit trickier, since they are used in
the middle of genes as well as at the beginning
– in eukaryotes, ATG is always the start codon,
– In prokaryotes, ATG, GTG, or TTG can be
used as a start codon.
In bioinformatics, we generally
ignore the fact that RNA uses the
base uracil (U) in place of T.
Reading Frames
•
Since codons consist of 3 bases, there are 3
“reading frames” possible on an RNA (or
DNA), depending on whether you start
reading from the first base, the second base, or
the third base.
– The different reading frames give entirely
different proteins.
•
Each gene uses a single reading frame, so
once the ribosome gets started, it just has to
count off groups of 3 bases to produce the
proper protein.
Open Reading Frames
•
Ribosomes are very obedient to stop codons: when a stop codon is reached, the protein is
finished. Thus, all genes end at the first stop codon in their reading frame.
•
Since 3 out of the 64 codons are stop codons, random DNA has stop codons very frequently.
– Open reading frames (ORFs) are regions with no stop codons. All genes reside in long open
reading frames
– Note that stop codons in other reading frames have no effect on the gene.
•
The start codon must occur “upstream” in the same reading frame as the stop codon. It is usually
near the beginning of the ORF, but not necessarily the first possible start codon.
– Determining the exact start codon is not easy or obvious.
– But, the first stop codon in an open reading frame is always a reasonable guess
BLAST (Basic Local
Alignment Search Tool)
•
The BLAST programs (Basic Local Alignment Search Tools) are a set of
sequence comparison algorithms introduced in 1990 that are used to search
sequence databases for optimal local alignments to a query.
•
BLAST itself is a bit of software that can be run on almost any computer, but the
database needed for a good cross-species comparison is quite large
– the database is called “nr” for “non-redundant”, and it contains at least 20
Gb of sequence data
•
Terminology: your sequence, which you paste into the box on the web site, is
the query sequence. Sequences in the database that match yours are called
subject sequences.
Global Alignment

Compares total length of two sequences
Local Alignment


Compares segments of sequences
Finds cases when one sequence is a part of another sequence,
or they only match in parts.
BLAST terminology
blast
sequence list:
Hits/subject
output
query sequence
target
database
(GenBank/
SwissProt)
information about
input query sequence, e.g.,
10
function
The aim of a database (blast) search is to discover sequence
homology on basis of sequence similarity
BLAST returns similar sequences, not necessarily biological
similar sequences

Search protein database using a translated
nucleotide query

Use to find homologous proteins to a nucleotide coding
region

Translates the query sequence in all six reading frames

Often the first analysis performed with a newly
determined nucleotide sequence

Search translated nucleotide database using
a protein query

Does six-frame translations of the nucleotide
database

Find homologous protein coding regions

Search translated nucleotide database using
a translated nucleotide query

Both translations use all six frames

Useful in identifying potential proteins

Good tool for identifying novel genes

Computationally intensive
BLAST Scores
•
•
•
•
Results are arranged with the best ones on top
The most important score is the Expect value, or E-value, which can be defined the
number of hits any random sequence (with the same length as yours) would have in the
database.
– E-values for good hits are usually written something like: 3e-42, which is the same
as 3 x 10-42 , a very small number
– Bad hits are very common, and they have e-values in a more familiar form: for
example, 0.004 or 1.2
In this case we see many hits with good e-values, and the top e-values all are quite
similar.
Before we can conclude that our protein is a homologue of the proteins BLAST matches
it with, we would like them to have roughly the same length and have a high percentage
of identical amino acids.
– the lengths of the query and subject sequences should be within 20% of each other
– There should be at least 30% identical amino acids
– In this case we can be quite sure we have a good match
•
BLAST also returns a fourth value, the bit score, which we are going to ignore.
A Sequence to BLAST
•
•
•
This is a more-or-less randomly
chosen gene from Bascillus
megaterium..
– It is 174 amino acids long
It is written in “fasta” format: the
first line starts with > and is
immediately followed by an
identifier (ORF00135), and then
some miscellaneous comments.
After that the sequence is written
without spaces or other marks.
>ORF00135 |chromosome 538197538721
MKAKLIQYVYDAECRLFKS
VNQHFDRKHLNRFLRLLTH
AGGATFTIVIACLLLFLYPSS
VAYACAFSLAVSHIPVAIAK
KLYPRKRPYIQLKHTKVLE
NPLKDHSFPSGHTTAIFSLVT
PLMIVYPAFAAVLLPLAVMV
GISRIYLGLHYPTDVMVGLI
LGIFSGAVALNIFLT
Protein Blast
Tblastn (Protein to Nucleotide)
Gene Names
•
Mostly genes are named with the function of their protein.
– at some point, some related genes had their function determined through lab work: by
examining the effects of mutations in the gene, by isolating and studying the protein
produced by the gene, etc.
– Enzymes (end in –ase), transport across the cell membrane, genetic information
processing (DNA->RNA->protein), structural proteins, sporulation and germination, and
more!
•
Many genes (maybe 1/4 of them in a typical genome) have no known function, although they
are found in several different species: conserved hypothetical genes
•
Every new genome has some genes that are unique: no matching BLAST hits in the database.
– Are they real genes? Sometimes there is evidence in the form of messenger RNA, but
usually we don’t know
– call them hypothetical genes
•
“putative” means that we think we know the gene’s function but we aren’t sure. Putative
should be followed by the function name.
Summary
1.
DNA can be read in 3 different reading frames, a consequence of the
genetic code (3 bases = 1 amino acid)
2.
Genes are found in long open reading frames, areas where there are
no stop codons.
3.
BLAST is the tool we use to compare sequences between species
•
BLAST scores (e-values) describe the probability of finding a
random sequence in the database
4.
Gene sequences are conserved between species by natural selection
•
DNA sequences outside of genes are much less conserved
Internet Links
blast.ncbi.nlm.nih.gov/ (Official)
www.ncbi.nlm.nih.gov/ (Official)
www.ebi.ac.uk/Tools/sss/ncbiblast/ (EU Database)
www.youtube.com/watch?v=rlK-5joOlyU (Video Tutorials)
http://en.wikipedia.org/wiki/BLAST (Text Knowledge)
http://www.dogpile.com/ (Preferred Search Engine)
Download