Introduction to NCBI & Ensembl tools including BLAST and database searching

advertisement
Introduction to NCBI &
Ensembl tools including
BLAST and database
searching
Incorporating Bioinformatics into the
High School Biology Curriculum
Fran Lewitter, Ph.D.
Director, Bioinformatics & Research Computing
Whitehead Institute for Biomedical Research
Lewitter AT wi.mit.edu
http://jura.wi.mit.edu/bio
What I hope you’ll learn
• What do we learn from database
searching and sequence
alignments
• What tools are available at NCBI
• What tools are available through
Ensembl
Lewitter-Whitehead Institute – August 15, 2012
2
First some background info
Lewitter-Whitehead Institute – August 15, 2012
3
Lewitter-Whitehead Institute – August 15, 2012
4
Simian sarcoma virus onc gene, v-sis, is derived
from the gene (or genes) encoding a platelet-derived
growth factor.
Doolittle RF, Hunkapiller MW, Hood LE, Devare
SG, Robbins KC, Aaronson SA, Antoniades
HN. Science 221:275-277, 1983.
Lewitter-Whitehead Institute – August 15, 2012
5
Why do alignments
• Use sequence similarity to infer homology and/or
structural similarity between 2 or more
genes/proteins
• Identify more conserved regions of a protein,
potentially identifying regions of most functional
importance
• Compare and contrast homologs (perhaps into
groups) based on shared positions or regions
• Infer evolutionary distance from sequence
dissimilarity
Lewitter-Whitehead Institute – August 15, 2012
6
Evolutionary Basis of
Sequence Alignment
• Similarity - observable quantity, such as
percent identity
• Homology - conclusion drawn from data
that two genes share a common
evolutionary history; no metric is associated
with this
– Paralog – genes related by duplication
– Ortholog – genes related by speciation
Lewitter-Whitehead Institute – August 15, 2012
7
Local vs Global Alignment
LOCAL
GLOBAL
From Mount, Bioinformatics, 2004, pg 71
Lewitter-Whitehead Institute – August 15, 2012
8
Nucleotide vs Protein
• If comparing protein coding genes, use
protein sequences because of less noise
• If protein sequences are very similar, it
might be more instructive to use DNA
sequences
Lewitter-Whitehead Institute – August 15, 2012
9
Example of simple scoring
system for nucleic acids
•
•
•
•
Match = +1 (ex. A-A, T-T, C-C, G-G)
Mismatch = -1 (ex. A-T, A-C, etc)
Gap opening = - 5
Gap extension = -2
T
C
A
G
A
C
G
A
G
T
G
T C G G A - - G C T G
+1 +1 -1 +1 +1 -5 -2 -1 -1 +1 +1 = -4
Lewitter-Whitehead Institute – August 15, 2012
10
Possible Alignments
A: T C A G A C G A G T G
B: T C G G A G C T G
I.
T C A G A C G A G T G
T C G G A - - G C T G
II.
T C A G A C G A G T G
T C G G A - G C - T G
score=-4
score=-5
III. T C A G A C G A G T G
T C G G A - G - C T G
score=-5
WIBR Sequence Analysis Course, © Whitehead Institute, February 2005
11
Scoring for Protein
Alignments - Amino Acid
Substitution Matrices
• PAM - point accepted mutation based on
global alignment [evolutionary model]
• BLOSUM - block substitutions based on local
alignments [similarity among conserved
sequences]
Lewitter-Whitehead Institute – August 15, 2012
12
AA Scoring Matrices
•PAM - point accepted mutation
based on global alignment
[evolutionary model]
Log-odds= pair in homologous proteins
pair in unrelated proteins by chance
•BLOSUM - block substitutions
based on local alignments [similarity
among conserved sequences]
Log-odds= obs freq of aa substitutions
freq expected by chance
Lewitter-Whitehead Institute – August 15, 2012
13
Substitution Matrices
BLOSUM 30 Increasing PAM 250 (80)
BLOSUM 62
BLOSUM 80
% identity
similarity
PAM 120 (66)
PAM 90
(50)
% change
Lewitter-Whitehead Institute – August 15, 2012
14
Scoring for BLAST Alignments
Score = 94.0 bits (230), Expect = 6e-19
Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%)
Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256
Y+ FC
+ + CY G G +YRG T SGA C PW S
V Q A+
Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257
Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297
GLG H +CRNPD D +PWC VL RL+WEYCD+ C T
Sbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298
Based on
BLOSUM62
Position
Position
Position
Position
.
Position
Position
.
1:
2:
3:
4:
. .
9:
10:
. .
Y
T
G
P
-
Y
S
S
E
=
=
=
=
7
1
0
-1
- - P = -11
- - A = -1
Sum
Lewitter-Whitehead Institute – August 15, 2012
230
15
Statistical Significance
• Raw Scores - score of an alignment equal to the sum of
substitution and gap scores.
• Bit scores - scaled version of an alignment’s raw score that
accounts for the statistical properties of the scoring system
used.
• E-value - expected number of distinct alignments that
would achieve a given score by chance. Lower E-value =>
more significant.
Lewitter-Whitehead Institute – August 15, 2012
16
A formula
E = Kmn e-lS
This is the Expected number of high-scoring
segment pairs (HSPs) with score at least S
for sequences of length m and n.
This is the E value for the score S.
The parameters K and l can be thought of simply as natural scales for the
search space size and the scoring system respectively.
Lewitter-Whitehead Institute – August 15, 2012
17
What’s significant?
• High confidence - >40% identity for long
alignments (Rost, 1999 found that sequence alignments
unambiguously distinguish between protein pairs of similar
and non-similar structure when the pairwise sequence identity
>40%)
• “Twilight zone” – blurry - 20-35% identity
• “Midnight zone” - <20% identity
Lewitter-Whitehead Institute – August 15, 2012
18
Tools of Interest
• NCBI (US) - http://www.ncbi.nlm.nih.gov/
– BLAST
– Entrez
• EBI/EMBL/WTSI (Europe)
– Ensembl (http://www.ensembl.org/)
Lewitter-Whitehead Institute – August 15, 2012
19
NCBI Home Page
Lewitter-Whitehead Institute – August 15, 2012
20
Ensembl Home Page
Lewitter-Whitehead Institute – August 15, 2012
21
Hands-On
• Entrez – finding sequences
• BLAST – look for similar sequences
• Ensembl – finding sequences
Lewitter-Whitehead Institute – August 15, 2012
22
How is BLAST used @ WIBR?
•
•
•
•
Looking for homologs
Identifying domains in proteins
Peptide identification
Genome annotation
Lewitter-Whitehead Institute – August 15, 2012
23
Remember
1. Don’t be afraid to look at help pages
for each website
• Click, click, click
2. Any analysis tool will give you
results
3. Interpret results you find
Lewitter-Whitehead Institute – August 15, 2012
24
Sequence
of 1992
Note
Biotechniques,
1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC
31 CCCCCTGACG AGCATCACAA AAATCGACGC
61 GGTGGCGAAA CCCGACAGGA CTATAAAGAT
…………..
1371 GTAAAGTCTG GAAACGCGGA AGTCAGCGCC
(Published in 1990)
“Here you see the actual structure of a small
fragment of dinosaur DNA,” Wu said. “Notice the
sequence is made up of four compounds - adenine,
guanine, thymine and cytosine. This amount of
DNA probably contains instructions to make a
single protein - say, a hormone or an enzyme. The
full DNA molecule contains three billion of these
bases. If we looked at a screen like this once a
second, for eight hours a day, it’d still take more
than two years to look at the entire DNA strand.
It’s that big.” (page 103)
Lewitter-Whitehead Institute – August 15, 2012
25
DinoDNA "Dinosaur DNA" from
Crichton's THE LOST WORLD p. 135
GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGC
ATCAGATACAGTTGGAGATAAGGACGACGTGTGG
CAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA
CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCT
GGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTC
CCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGG
GGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGC
CTCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTG
CCGTGGCAGACACGGGTACTTTGGGGACCCCCCA
GTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCC
CACTACCTGGAGCTGCTGCAACCCCCCCGGGGCA
GCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCC
ACTCAGCAGCGCCTGCGGCCTCTACTACAAAC……
Published in 1995
Lewitter-Whitehead Institute – August 15, 2012
26
>Erythroid transcription factor (NF-E1 DNA-binding protein)
Query: 121
Sbjct: 1
Query: 301
Sbjct: 61
Query: 481
Sbjct: 117
Query: 661
Sbjct: 170
Query: 841
Sbjct: 227
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV
NCGAT
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116
ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660
ATPLWRRDGTGHYLCN
ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS
NCQT
ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169
STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840
STTTLWRRSPMGDPVCN
ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG
STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226
GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020
GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF
GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286
Query: 1021 FGGGAGGYTAPPGLSPQI 1074
FGGGAGGYTAPPGLSPQI
Sbjct: 287 FGGGAGGYTAPPGLSPQI 304
Lewitter-Whitehead Institute – August 15, 2012
27
>Erythroid transcription factor (NF-E1 DNA-binding protein)
Query: 121
Sbjct: 1
Query: 301
Sbjct: 61
Query: 481
Sbjct: 117
Query: 661
Sbjct: 170
Query: 841
Sbjct: 227
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG
MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV
NCGAT
TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116
ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660
ATPLWRRDGTGHYLCN
ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS
NCQT
ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169
STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840
STTTLWRRSPMGDPVCN
ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG
STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226
GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020
GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF
GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286
Query: 1021 FGGGAGGYTAPPGLSPQI 1074
FGGGAGGYTAPPGLSPQI
Sbjct: 287 FGGGAGGYTAPPGLSPQI 304
Lewitter-Whitehead Institute – August 15, 2012
28
Download