Protein databases Most of these databases can be accessed by : Sequence identifier Keywords BLAST 3D structure visualization Protein workshop Swiss PDBViewer Translation Genome information GenBank (Entrez nucleotide) Species-specific databases ex2 ? Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure Protein Data Bank (PDB) Homology modeling Swiss Model BLAST ex1 The GenBank 179.0 Release (Aug 16th 2010) requires roughly 451 GB (uncompressed sequence files only). Similar protein sequences/Domain analysis Protein Families (Pfam) CLUSTAL-W Evolution trees 1 BLAST : basic local alignment search tool Smith-Waterman algorithm Substitution matrix BLOSUM = Block Substitution Matrix PAM : Point/Percent accepted mutation Gap insertion penalty Gap extension penalty find all segment pairs whose scores can not be improved by extension or trimming query : CGNLSTCMLGTYTQDFNKF-----HTFPQTAIGVGAP | .||. :.: : : :..| :| : match : KCNTATCATQRLANFLVHSSNNFGAILSSTNVGSNTY or Sbjct or hit High-Scoring Element Pairs (HSP) scores E-value P-value protein sequence database cutoff Multiple alignment : ClustalW Altschul SF et al. Basic Local Alignment Search Tool. J Mol Biol. 1990; 215: 403–410. 2 Alignment score matrices Example of BLOSUM 62 : set of ‘trusted’ aligned protein sequences select pairs of sequences with less than 62% identity calculate probability frequency pa,b sa,b p log a,b f a f b 1 # Entries for the A R N D C A 4 -1 -2 -2 0 R -1 5 0 -2 -3 N -2 0 6 1 -3 D -2 -2 1 6 -3 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 E -1 0 0 2 -4 G 0 -2 0 -1 -3 H -2 0 1 -1 -3 I -1 -3 -3 -3 -1 L -1 -2 -3 -4 -1 K -1 2 0 -1 -3 M -1 -1 -2 -3 -1 F -2 -3 -3 -3 -2 P -1 -2 -2 -1 -3 S 1 -1 1 0 -1 T 0 -1 0 -1 -1 W -3 -3 -4 -4 -2 Y -2 -2 -2 -3 -2 V 0 -3 -3 -3 -1 B -2 -1 4 4 -3 J -1 -2 -3 -3 -1 Z -1 0 0 1 -3 X -1 -1 -1 -1 -1 * -4 -4 -4 -4 -4 BLOSUM62 Q E G -1 -1 0 1 0 -2 0 0 0 0 2 -1 -3 -4 -3 5 2 -2 2 5 -2 -2 -2 6 0 0 -2 -3 -3 -4 -2 -3 -4 1 1 -2 0 -2 -3 -3 -3 -3 -1 -1 -2 0 0 0 -1 -1 -2 -2 -3 -2 -1 -2 -3 -2 -2 -3 0 1 -1 -2 -3 -4 4 4 -2 -1 -1 -1 -4 -4 -4 where fx is the occurrence probability of amino acid x matrix H I -2 -1 0 -3 1 -3 -1 -3 -3 -1 0 -3 0 -3 -2 -4 8 -3 -3 4 -3 2 -1 -3 -2 1 -1 0 -2 -3 -1 -2 -2 -1 -2 -3 2 -1 -3 3 0 -3 -3 3 0 -3 -1 -1 -4 -4 at L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 3 -3 -1 -4 a scale of ln(2)/2.0. K M F P S T W Y V B -1 -1 -2 -1 1 0 -3 -2 0 2 -1 -3 -2 -1 -1 -3 -2 -3 0 -2 -3 -2 1 0 -4 -2 -3 -1 -3 -3 -1 0 -1 -4 -3 -3 -3 -1 -2 -3 -1 -1 -2 -2 -1 1 0 -3 -1 0 -1 -2 -1 -2 1 -2 -3 -1 0 -1 -3 -2 -2 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 -3 1 0 -3 -2 -1 -3 -1 3 -2 2 0 -3 -2 -1 -2 -1 1 5 -1 -3 -1 0 -1 -3 -2 -2 -1 5 0 -2 -1 -1 -1 -1 1 -3 0 6 -4 -2 -2 1 3 -1 -1 -2 -4 7 -1 -1 -4 -3 -2 0 -1 -2 -1 4 1 -3 -2 -2 -1 -1 -2 -1 1 5 -2 -2 0 -3 -1 1 -4 -3 -2 11 2 -3 -2 -1 3 -3 -2 -2 2 7 -1 -2 1 -1 -2 -2 0 -3 -1 4 0 -3 -3 -2 0 -1 -4 -3 -3 -3 2 0 -3 -2 -1 -2 -1 2 1 -1 -3 -1 0 -1 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4 -4 -4 -4 -4 -4 -4 -4 -4 Sean R Eddy 2004, Nature Biotechnology 22 :1035-6 J -2 -1 4 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -3 0 -1 -4 Z -1 -2 -3 -3 -1 -2 -3 -4 -3 3 3 -3 2 0 -3 -2 -1 -2 -1 2 -3 3 -3 -1 -4 X -1 0 0 1 -3 4 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -2 -2 -2 0 -3 4 -1 -4 * -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 BLOSUM80 : more conserved sequences BLOSUM40 : more divergent sequences 3 Evaluation of the similarity : E- and P-value m : query size n : database size S : score E-value : the expected number of HSPs with score at least S is E = K m n e-S where K and depends on the database statistics (amino acid frequencies) and on the scoring system. K and are estimated from the score distribution. P-value : the probability that the score S from the comparison of two unrelated sequences is at least x is Example of score distribution fitted with the E-distribution P- score distribution of the same data P(S ≥ x) = 1 - e-E(x) For small E-values, P ≈ E Bit-scores : normalized E-values. E = m n 2-S’ S' S lnK ln2 4 Practical BLAST The different BLAST programs : Program Database BLASTN nucleotide BLASTP protein BLASTX protein TBLASTN translated nucleotide TBLASTX translated nucleotide Query nucleotide protein translated nucleotide protein translated nucleotide Databases : Species-specific genomes (not curated) : choose one or more species or group at http://www.ncbi.nlm.nih.gov/mapview/ Protein database (curated) : http://www.uniprot.org/ Parameters : Cutoff E ≤ 0.01 : conservative search Cutoff E ≤ 1 : weak homologies Gap penalties : gap-open, gap-extend ... Let them as they are, to start with ! Filter repetitive sequences : Yes ! PSI-BLAST : an iterative BLAST program, to find distantly related proteins 5 More information GenBank, Pubmed, Entrez The NCBI handbook http://www.ncbi.nlm.nih.gov/ More on bioinformatics Bioinformatics for Human Biologists - course programme, winter 2009 http://www.cbs.dtu.dk/courses/humanbio/2009/programme.php Expasy UniProtKB protein database Protein analysis tools, Swiss-PDB Viewer, Swiss-Model http://expasy.org/ Protein DataBank Protein 3D structure, Protein workshop http://www.pdb.org/pdb/home/home.do Protein families (Pfam) http://pfam.sanger.ac.uk/ 6 Example : calcitonin sequence Expasy UniProtKB ‘human calcitonin’ P01258 (CALC_HUMAN) Retrieve calcitonin peptide sequence in FASTA format : >P01258|85-116 CGNLSTCMLGTYTQDFNKFHTFPQTAIGVGAP 7 Graphical overview of BLAST results The query sequence is represented by the numbered red bar at the top of the figure. Database hits are shown aligned to the query, below the red bar. Of the aligned sequences, the most similar are shown closest to the query. In this case, there are three high-scoring database matches that align to most of the query sequence. The next twelve bars represent lower-scoring matches that align to two regions of the query, from about residues 3–60 and residues 220–500. The cross-hatched parts of the these bars indicate that the two regions of similarity are on the same protein, but that this intervening region does not match. The remaining bars show lower-scoring alignments. Mousing over the bars displays the definition line for that sequence to be shown in the window above the graphic. The NCBI handbook, The BLAST Sequence Analysis Tool, Tom Madden 8 The UniProtKB database A curated database : SwissProt A Bairoch et al. Organism distribution An automated database : TrEMBL Sequence length distribution H sapiens : 0.6% Release 2010_09 of 10-Aug-2010 9 10