Where to find protein sequences ?

advertisement
Protein databases
Most of these databases can be
accessed by :
 Sequence identifier
 Keywords
 BLAST
3D structure visualization
Protein workshop
Swiss PDBViewer
Translation
Genome information
GenBank (Entrez nucleotide)
Species-specific databases
ex2
?
Protein sequence
GenBank (Entrez protein)
UniProtKB (SwissProt)
Protein structure
Protein Data Bank (PDB)
Homology modeling
Swiss Model
BLAST
ex1
The GenBank 179.0 Release (Aug
16th 2010) requires roughly 451
GB (uncompressed sequence
files only).
Similar protein sequences/Domain analysis
Protein Families (Pfam)
CLUSTAL-W
Evolution trees
1
BLAST : basic local alignment search tool
Smith-Waterman algorithm
Substitution matrix
BLOSUM = Block Substitution Matrix
PAM : Point/Percent accepted mutation
Gap insertion penalty
Gap extension penalty
find all segment pairs whose scores can not
be improved by extension or trimming
query :
CGNLSTCMLGTYTQDFNKF-----HTFPQTAIGVGAP
| .||.
:.: : :
:..| :|
:
match : KCNTATCATQRLANFLVHSSNNFGAILSSTNVGSNTY
or Sbjct
or hit
High-Scoring Element Pairs (HSP)
scores
E-value
P-value
protein sequence database
cutoff
Multiple alignment : ClustalW
Altschul SF et al. Basic Local Alignment Search Tool. J Mol Biol. 1990; 215: 403–410.
2
Alignment score matrices
Example of BLOSUM 62 : set of ‘trusted’ aligned protein sequences  select pairs
of sequences with less than 62% identity  calculate probability frequency pa,b
sa,b 
 p 
log a,b 
  f a f b 
1
# Entries for the
A R N D C
A 4 -1 -2 -2 0
R -1 5 0 -2 -3
N -2 0 6 1 -3
D -2 -2 1 6 -3
C 0 -3 -3 -3 9
Q -1 1 0 0 -3
E -1 0 0 2 -4
G 0 -2 0 -1 -3
H -2 0 1 -1 -3
I -1 -3 -3 -3 -1
L -1 -2 -3 -4 -1
K -1 2 0 -1 -3
M -1 -1 -2 -3 -1
F -2 -3 -3 -3 -2
P -1 -2 -2 -1 -3
S 1 -1 1 0 -1
T 0 -1 0 -1 -1
W -3 -3 -4 -4 -2
Y -2 -2 -2 -3 -2
V 0 -3 -3 -3 -1
B -2 -1 4 4 -3
J -1 -2 -3 -3 -1
Z -1 0 0 1 -3
X -1 -1 -1 -1 -1
* -4 -4 -4 -4 -4
BLOSUM62
Q E G
-1 -1 0
1 0 -2
0 0 0
0 2 -1
-3 -4 -3
5 2 -2
2 5 -2
-2 -2 6
0 0 -2
-3 -3 -4
-2 -3 -4
1 1 -2
0 -2 -3
-3 -3 -3
-1 -1 -2
0 0 0
-1 -1 -2
-2 -3 -2
-1 -2 -3
-2 -2 -3
0 1 -1
-2 -3 -4
4 4 -2
-1 -1 -1
-4 -4 -4
where fx is the occurrence probability of amino acid x
matrix
H I
-2 -1
0 -3
1 -3
-1 -3
-3 -1
0 -3
0 -3
-2 -4
8 -3
-3 4
-3 2
-1 -3
-2 1
-1 0
-2 -3
-1 -2
-2 -1
-2 -3
2 -1
-3 3
0 -3
-3 3
0 -3
-1 -1
-4 -4
at
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
-4
3
-3
-1
-4
a scale of ln(2)/2.0.
K M F P S T W Y V B
-1 -1 -2 -1 1 0 -3 -2 0
2 -1 -3 -2 -1 -1 -3 -2 -3
0 -2 -3 -2 1 0 -4 -2 -3
-1 -3 -3 -1 0 -1 -4 -3 -3
-3 -1 -2 -3 -1 -1 -2 -2 -1
1 0 -3 -1 0 -1 -2 -1 -2
1 -2 -3 -1 0 -1 -3 -2 -2
-2 -3 -3 -2 0 -2 -2 -3 -3
-1 -2 -1 -2 -1 -2 -2 2 -3
-3 1 0 -3 -2 -1 -3 -1 3
-2 2 0 -3 -2 -1 -2 -1 1
5 -1 -3 -1 0 -1 -3 -2 -2
-1 5 0 -2 -1 -1 -1 -1 1
-3 0 6 -4 -2 -2 1 3 -1
-1 -2 -4 7 -1 -1 -4 -3 -2
0 -1 -2 -1 4 1 -3 -2 -2
-1 -1 -2 -1 1 5 -2 -2 0
-3 -1 1 -4 -3 -2 11 2 -3
-2 -1 3 -3 -2 -2 2 7 -1
-2 1 -1 -2 -2 0 -3 -1 4
0 -3 -3 -2 0 -1 -4 -3 -3
-3 2 0 -3 -2 -1 -2 -1 2
1 -1 -3 -1 0 -1 -2 -2 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1
-4 -4 -4 -4 -4 -4 -4 -4 -4
Sean R Eddy 2004, Nature Biotechnology 22 :1035-6
J
-2
-1
4
4
-3
0
1
-1
0
-3
-4
0
-3
-3
-2
0
-1
-4
-3
-3
4
-3
0
-1
-4
Z
-1
-2
-3
-3
-1
-2
-3
-4
-3
3
3
-3
2
0
-3
-2
-1
-2
-1
2
-3
3
-3
-1
-4
X
-1
0
0
1
-3
4
4
-2
0
-3
-3
1
-1
-3
-1
0
-1
-2
-2
-2
0
-3
4
-1
-4
*
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
1
BLOSUM80 : more
conserved sequences
BLOSUM40 : more
divergent sequences
3
Evaluation of the similarity : E- and P-value
m : query size
n : database size
S : score
E-value : the expected number of HSPs with
score at least S is
E = K m n e-S
where K and  depends on the database
statistics (amino acid frequencies) and on the
scoring system. K and  are estimated from
the score distribution.
P-value : the probability that the score S from
the comparison of two unrelated sequences is
at least x is
Example of score distribution
fitted with the E-distribution
P- score distribution of the
same data
P(S ≥ x) = 1 - e-E(x)
For small E-values, P ≈ E
Bit-scores : normalized E-values. E = m n 2-S’
S' 
S  lnK
ln2
4
Practical BLAST
The different BLAST programs :
Program
Database
BLASTN
nucleotide
BLASTP
protein
BLASTX
protein
TBLASTN
translated nucleotide
TBLASTX
translated nucleotide
Query
nucleotide
protein
translated nucleotide
protein
translated nucleotide
Databases :
Species-specific genomes (not curated) : choose one or more species or group at
http://www.ncbi.nlm.nih.gov/mapview/
Protein database (curated) : http://www.uniprot.org/
Parameters :
Cutoff E ≤ 0.01 : conservative search
Cutoff E ≤ 1 : weak homologies
Gap penalties : gap-open, gap-extend ...
Let them as they are, to start with !
Filter repetitive sequences : Yes !
PSI-BLAST : an iterative BLAST
program, to find distantly related
proteins
5
More information
GenBank, Pubmed, Entrez
The NCBI handbook
http://www.ncbi.nlm.nih.gov/
More on bioinformatics
Bioinformatics for Human Biologists - course programme, winter 2009
http://www.cbs.dtu.dk/courses/humanbio/2009/programme.php
Expasy
UniProtKB protein database
Protein analysis tools, Swiss-PDB Viewer, Swiss-Model
http://expasy.org/
Protein DataBank
Protein 3D structure, Protein workshop
http://www.pdb.org/pdb/home/home.do
Protein families (Pfam)
http://pfam.sanger.ac.uk/
6
Example : calcitonin sequence
Expasy  UniProtKB  ‘human calcitonin’
 P01258 (CALC_HUMAN)
Retrieve calcitonin peptide sequence in FASTA format :
>P01258|85-116
CGNLSTCMLGTYTQDFNKFHTFPQTAIGVGAP
7
Graphical overview of BLAST results
The query sequence is represented by the numbered red bar at the top of the figure. Database hits are shown
aligned to the query, below the red bar. Of the aligned sequences, the most similar are shown closest to the query.
In this case, there are three high-scoring database matches that align to most of the query sequence. The next
twelve bars represent lower-scoring matches that align to two regions of the query, from about residues 3–60 and
residues 220–500. The cross-hatched parts of the these bars indicate that the two regions of similarity are on the
same protein, but that this intervening region does not match. The remaining bars show lower-scoring alignments.
Mousing over the bars displays the definition line for that sequence to be shown in the window above the graphic.
The NCBI handbook, The BLAST Sequence Analysis Tool, Tom Madden
8
The UniProtKB database
A curated database : SwissProt
A Bairoch et al.
Organism distribution
An automated database : TrEMBL
Sequence length distribution
H sapiens : 0.6%
Release 2010_09 of 10-Aug-2010
9
10
Download