Protein functions prediction Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Introduction Signal peptides Transmembrane regions and topology PTM (post-translational modifications) Low complexity and biased regions Repeats Coils Secondary structure Antigenic peptides Domain/Motifs Tools The EMBOSS package Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Different techniques Algorithms Sliding window, Nearest Neighbor Patterns, regular expression Weight matrices HMM, profiles Neural Networks Rules Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Sliding window THISISATESTSEQVENCETHATDISPLAYSTHESLIDINGWINDQW Score1 Score2 Scoren Width or Size=11, Step=5 Results are usually displayed as a graph, see example -> Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Patterns / regular expression Pattern: <A-x-[ST](2)-x(0,1)-{V} Regexp: ^A.[ST]{2}.?[^V] Text: The sequence must start with an alanine, followed by any amino acid, followed by a serine or a threonine, two times, followed by any amino acid or nothing, followed by any amino acid except a valine. Simply the syntax differ… Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Weight matrices (PSSM) Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 HMM / profiles Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Neural Networks General principle: Example: Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Signals found in proteins N-ter exportation - secretion mitochondria chloroplast internal NLS (nuclear localization signal) C-ter GPI-anchor (Glycosyl Phosphatidyl Inositol) other membrane anchors (see PTM) other unknown ? Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Signals detection tools SignalP MitoProt ChloroP Predotar PSort TargetP Sigcleave (EMBOSS) Phobius Big-PI DGPI Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Transmembrane regions Detection (signal peptide, hydropathy, helices) Organisation (topology) Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Transmembrane detection tools TMHMM TMPred TopPred2 DAS HMMTop Tmap (EMBOSS) Mixture of tools Phobius ConPred II Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Post translational modifications Phosphorylation S - T - (HO)K Acetylation, methylation N O-glycosylation S-T-Y N-glycosylation D-E-K Sulfation Y Farnesylation, myristylation, palmitoylation, geranylgeranylation, GPIanchor Ubiquitination and family C - Nter - Cter K - Nter Inteins (protein splicing) Pre-translational Selenoprotein C Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 PTM detection Pattern prediction (PROSITE) Short or weak signal Frequent hit producer Best method is experimental MS/MS detection Most method use « rules » joining pattern detection and knowledge to predict sites. NetOGlyc - Prediction of type Oglycosylation sites in mammalian proteins DictyOGlyc - Prediction of GlcNAc O-glycosylation sites in Dictyostelium YinOYang - O-beta-GlcNAc attachment sites in eukaryotic protein sequences NetPhos - Prediction of Ser, Thr and Tyr phosphorylation sites in eukaryotic proteins NMT - Prediction of N-terminal Nmyristoylation Sulfinator - Prediction of tyrosine sulfation sites Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Low complexity regions repeats compositional bias PEST Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Low complexity / Repeats DUST (DNA) / SEG search collection search collection REPRO, Radar REP de novo detection EMBOSS (DNA) RepeatMasker (DNA) de novo detection einverted equicktandem etandem palindrome EMBOSS (protein) oddcomp PEST, PESTFind de novo detection Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Coils Helix of helix coiled-coil Leu-zipper Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Coils detection COILS Paircoil, Multicoil Pairwise correlation Marcoil Weight matrices HMM Pepcoil (EMBOSS) Weight matrices Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Secondary structure Structure to predict Alpha-helices Beta-sheets Turns Random coil Garnier (EMBOSS) PHD DSC PREDATOR NNSSP Jpred Jnet Many others Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Antigenic peptide Peptides binding to MHC class I Use of experimental knowledge 8, 9, 10 mers class II 15 mers (3+9+3) Depend highly on MHC type Databases of known peptides SYFPEITHI HLA_Bind (BIMAS) MAPPP combined expert Antigenic (EMBOSS) Many more Prediction of proteasome cleavage sites NetChop PaProc Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Domain / Motif All the protein domain descriptors PROSITE PFAM SMART PRODOM BLOCKS PRINTS TIGRfam … Federation: InterPro Many techniques Patterns, Regexp PSSM (PSI-BLAST) Profiles HMM Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Other Tools You can find some of them on our servers Or on ExPASy server www.ch.embnet.org www.expasy.org/tools Or ask Google!! www.google.com Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 European Molecular Biology Open Software Suite Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 How to use EMBOSS/Jemboss at SIB Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Free Open Source (for most Unix plateforms) GCG successor (compatible with GCG file format) More than 150 programs (ver. 2.9.0) Easy to install locally Interfaces but no interface, requires local databases Unix command-line only Jemboss, www2gcg, w2h, wemboss… (with account) Pise, EMBOSS-GUI, SRSWWW (no account) Staden, Kaptain, CoLiMate, Jemboss (local) Access: www.emboss.org or emboss.sourceforge.net Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Some details Format USA 'asis' Format Format Format Format Format Format Format :: :: :: :: :: :: :: :: Sequence [start : end : reverse] '@' ListFile [start : end : reverse] 'list' : ListFile [start : end : reverse] Database : Entry [start : end : reverse] Database - SearchField : Word [start : end : reverse] File : Entry [start : end : reverse] File : SearchField : Word [start : end : reverse] Program Program-parameters '|' [start : end : reverse] Example: fasta::Swissprot:UBP5_HUMAN[200:300] Databases Any can be added, use showdb to display the available databases Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 databases showdb Displays information on the currently available databases # Name Type ID Qry All Comment # ==== ==== == === === ======= ipr_fetch P OK OK OK InterPro current by fetch ipi_fetch P OK OK OK IPI current by fetch refseq_fetch P OK OK OK refseq current by fetch repbase_fetch P OK OK OK repbase current by fetch swiss_fetch P OK OK OK SwissProt current by fetch swissprot P OK OK OK SWISSPROT sequences trembl P OK OK OK TREMBL sequences trembl_fetch P OK OK OK trembl current by fetch tremblnew P OK OK OK TREMBL New sequences ug_fetch P OK OK OK Unigene by fetch embl N OK OK OK EMBL release emhum N OK OK OK EMBL release, Human section by emboss index emrod N OK OK OK EMBL release, Rodent section by emboss index emvrt N OK OK OK EMBL release, Vertebrate (nonhuman, nonrodent) seqret (seqretall, seqretset, seqretsplit) entret (for complete untouched entry, e.g., for unigene, interpro, swissprot…) Possible to define your own « .embossrc » file Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Some tools for DNA redata remap restover restrict showseq silent cirdna lindna revseq … Search REBASE for enzyme name, references, suppliers etc Display a sequence with restriction cut sites, translation etc Finds restriction enzymes that produce a specific overhang Finds restriction enzyme cleavage sites Display a sequence with features, translation etc Silent mutation restriction enzyme scan Draws circular maps of DNA constructs Draws linear maps of DNA constructs Reverse and complement a sequence Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Example: remap ECLAC E.coli lactose operon with lacI, lacZ, lacY and lacA genes. Hin6I TaqI | HhaI | Bsc4I | Bsu6I | | Hin6I | BssKI | | | HhaI AciI | | BsiSI \ \ \ \ \ \ \ \ GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGT 10 20 30 40 50 60 ----:----|----:----|----:----|----:----|----:----|----:----| CTGTGGTAGCTTACCGCGTTTTGGAAAGCGCCATACCGTACTATCGCGGGCCTTCTCTCA / / / / / / / /// | TaqI | Hin6I AciI | | ||BssKI Bsc4I HhaI | | |BsiSI | | Bsu6I | Hin6I HhaI # Enzymes that cut Frequency Isoschizomers AciI 1 Bsc4I 1 BsiSI 1 BssKI 1 Bsu6I 1 HhaI 2 Hin6I 2 HinP1I,HspAI TaqI 1 # Enzymes that do not cut AclI BamHI BceAI Bse1I BshI ClaI EcoRI EcoRII Hin4I HindII HindIII HpyCH4IV KpnI NotI Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Example: cirdna File: ../../data/data.cirp Start 1001 End 4270 group label Block 1011 ex1 endlabel label Tick 1610 EcoR1 endlabel label Block 1647 endlabel label Tick 2459 BamH1 endlabel label Block 4139 ex2 endlabel endgroup group label Range 2541 Alu endlabel label Range 3322 MER13 endlabel endgroup 1362 3 8 1815 1 8 4258 3 2812 [ ] 5 3497 > < 5 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Example: plotorf Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 EMBOSS format input/output UFO Universal Feature Object Alignments Multiple and pairwise, many flavors (FASTA, MSF, SRS…) Reports gff, swissprot, embl, pir, nbrf (with or without sequence) Feature (UFO), SRS, motif, seqtable, excel, diffseq, listfile (USA), etc… Sequences (compatible with USA) Many!!! E.g., fasta, clustal, gcg, paup, gff, embl, swissprot, acedb, abi, etc… Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Web interfaces PISE (Pasteur Institute Software Environment) http://www-alt.pasteur.fr/~letondal/Pise/ wEMBOSS (Belgium&Argentina) (not yet at SIB) http://www.wemboss.org Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Pise http://emboss.ch.embnet.org/Pise a tool to generate Web interfaces for Molecular Biology programs Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 http://www.wemboss.org Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Launch Jemboss http://emboss.ch.embnet.org/Jemboss Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Launch Jemboss First time only… Each time… Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Jemboss windows Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Jemboss windows other systems Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Summary Anonymous web access through Pise Registered access through Jemboss Registered access through command-line (requires UNIX skills) Please report problems! Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Exercises DEA Exercises web based sequence analysis The goal of this exercise is to use web based tools for protein sequence analysis List of useful links: a) Take this TrEMBL sequence (Q9X252) and try a BLAST against swissprot with the complete protein or with the first 70 residues. Explain the difference. Use TMPred, SignalP, and COILS to help you. b) Pass this sequence through PFSCAN and search all databases. Compare with this command on ludwig-sun1/2: hits -b "prf pat pfam" tr:Q9X252 c) use the different profile, motifs, pattern databases to get more information about the domain(s) you found. d) How do you evaluate the PRINTS tropomyosin annotation in this TrEMBL entry (Q9WZH0)? basic BLAST or advanced BLAST or PSI-BLAST TMPred prediction tool for transmembrane regions (or TMHMM) COILS prediction tool for coiled-coil regions SignalP prediction tool for signal-peptide cleavage site Profile, domain, motifs databases and search sites: PFSCAN InterPro (Pfam, PRINTS, PROSITE, SMART) HITS Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08