(PSI-)BLAST & MSA via Max-Planck General Issues • Where? (to find homologues) • Structural templates- search against the PDB • Sequence homologues- search against SwissProt or Uniprot (recommended!) • How many? • As many as possible, as long as the MSA looks good (next week…) General Issues • How long? (length of homologues) • Fragments- short homologues (less than 50,60% the query’s length) = bad alignment • Ensure your sequences exhibit the wanted domain(s) • N/C terminal tend to vary in length between homologues • How close? (distance from query sequence) • All too close- no information • Too many too far- bad alignment • Ensure that you have a balanced collection! General Issues • From who? (which species the sequence belongs to) • Don’t care, all homologues are welcome • Orthologues/paralogues may be helpful • Sequences from distant/close species provide different types of information • Which method? (BLAST/PSI-BLAST) • Depends on the protein, available homologues, the goal in mind… General Issues Rules For Choosing Sequences • Very similar sequences have little information • Very different sequences cause trouble…<30% identical with more than half of the other sequences in the set • Choose sequences as distantly related as possible Sequence between 30-80% identical with more than half of the sequences in the set • The more sequences the better Overall work steps 1. Run the search1. Select database 2. E-value threshold 3. BLAST or PSI-BLAST- how many rounds? 2. Take out sequences- HSP (slider region) or full sequences 3. Align sequences- choose alignment program 4. View alignment with BioEdit tor another program 5. Calculate trees, conservation scores (ConSurf) etc… (PSI-)BLAST via Max-Planck http://toolkit.tuebingen.mpg.de/sections/search • Databases- swissprot, tremble, NR, env, pdb or any combination for proteins, but only NT for DNA. • All BLAST programs Main advantage- you can easily extract and filter the HSPs, on top of full sequences The Query Protein Name: Dihydrodipicolinate reductase Enzyme reaction: Molecular process: Lysine biosynthesis (early stages) Organism: E. coli Sequence length: 273 aa The Query Protein Query: DAPB_ECOLI >DAPB_ECOLI MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAGKTGVTVQSSLDAV KDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQAIRDAAADIAIVFAANFSVGVNVMLKLL EKAAKVMGDYTDIEIIEAHHRHKVDAPSGTALAMGEAIAHALDKDLKDCAVYSREGHTGERVPGTIGFATV RAGDIVGEHTAMFADIGERLEITHKASSRMTFANGAVRSALWLSGKESGLFDMRDVLDLNNL (PSI-)BLAST via Max-Planck http://toolkit.tuebingen.mpg.de/psi_blast/ Upload sequence or MSA Choose database or databases (selecting a few using CTRL) (PSI-)BLAST via Max-Planc (PSI-)BLAST via Max-Planc (PSI-)BLAST via Max-Planc (PSI-)BLAST via Max-Planc (PSI-)BLAST via Max-Planck E-value threshold can be assessed using the distribution Forward results to MSA http://toolkit.tuebingen.mpg.de/sections/alignment Forward results to MSA Forward results to MSA All marked hits or filter by e-value HSP (sider region) or full sequences Forward results to MSA Align via Max-Planck Alignment results: Save the alignment Alignmen viewing & editing BioEdit • http://www.mbio.ncsu.edu/BioEdit/BioEdit.html • Easy-to-use sequence alignment editor • View and manipulate alignments up to 20,000 sequences. •Four modes of manual alignment: select and slide, dynamic grab and drag, gap insert and delete by mouse click, and on-screen typing which behaves like a text editor. •Reads and writes Genbank, Fasta, Phylip 3.2, Phylip 4, and NBRF/PIR formats. Also reads GCG and Clustal formats Alignment viewing & editing Easiest Using Bioedit http://www.mbio.ncsu.edu/BioEdit/bioedit.html Alignment viewing & editing Easiest Using Bioedit • Find a specific sequence: “Edit-> search -> in titles” • Erase\add sequences: “Edit-> cut\paste\delete sequence” • “Sequence Identity matrix” under “Alignment”useful for a rough evaluation of distances within the alignment. • After taking out sequences, “Minimize Alignment” under “Alignment” takes out unessential gaps. • Can save an image using: “File -> Graphic View” & then “Edit -> Copy page as BITMAP” http://www.mbio.ncsu.edu/BioEdit/bioedit.html A little of ConSurf Compute Conservation Scores • Give an MSA or will compute one for you (given a FASTA sequence, BLAST & MSA) Main advantage: filters short HSPs, removes redundant sequences • Shows conservation scores on sequence or on a protein structure (if available) ConSurf http://consurf.tau.ac.il/ ConSurf ConSurf http://consurf.tau.ac.il/results/1321532763/output.php ConSurf http://consurf.tau.ac.il/results/1321532763/output.php ConSurf Sequence conservation MSA colored by conservation PSI-BLAST result Sequences used MSA Phylogenetic tree ConSurf Jmol- Easy web-based viewer WebLogo http://weblogo.berkeley.edu/logo.cgi WebLogo http://weblogo.berkeley.edu/logo.cgi No “Miracle solution” Each sequence is a different story adjust parameters: • BLAST- E-value, substitution matrix, gap penalties, database, minimum length, redundancy level, fragment overlap… • PSI-BLAST- BLAST parameters + PSSM inclusion threshold (or chose manually), number of rounds… • Try using HSP or full sequences, different MSA programs…