Automate Function Prediction Outline • • • • • Goal How function is defined Why Gene Ontology Methods for protein function prediction End points GOAL • A) You find a new protein • B) You sequence the whole genome of your favorite organism • Obtained gene(s) should be annotated • A can be solved manually. B needs automatic tools How function is defined • • • • Functional description as text Linking gene to Key Words (Uniprot) Linking gene Gene Ontology Linking gene to Signalling Pathways or Biochemical Pathways (KEGG) Why Gene Ontology (GO) • GO represents a popular standard currently in the gene annotation • GO represents categories that represent gene function • Creates an union for genes in same process • Easy summary for genes with similar function Why Gene Ontology (GO) • 3 sub-parts: Biological Process, Molecular Function, Cellular Localization – Molecular Function => chemical activity – Biological Process => Biology, cellular process – Cellular localization => Location of gene • Hierarchical structure – Categories with very precise function – Categories with less precise function – Categories with very broad function How GO helps • End user: Summary categories for genes with various functions • Computer programs: Classifier algorithms can be taught to predict the categories for genes Understanding GO • Amigo server (http://amigo.geneontology.org/cgibin/amigo/go.cgi) Function Prediction: What can we use to predict function • • • • • Sequence homology (BLAST result list) Phylogenetic tree of sequences Protein Domains (PFAM domains) Short sequence patterns – motifs Sequence features (sec. struct., low compl. regions) Sequence Homology Methods • Do a BLAST search with a query sequence • Collect GO classes for genes in the BLAST result hit • Give a weight to each BLAST hit – often log(E-value) • Combine the scores from the genes that belong to same GO class • Report the top best / significant GO classes Sequence Homology Methods • Simple methods • Programs – BLAST2GO (http://www.blast2go.com/b2ghome) – GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php) – ARGOT(http://www.medcomp.medicina.unipd.it/Argot2/form.php) – PFP (http://kiharalab.org/web/pfp.php) Phylogenetic tree methods • Create the pair-wise distances for the set of genes • Do a hierarchical clustering of genes • Map the know GO functions to cluster tree • Look for unknown genes in a cluster with many genes from the same GO class • Report the top best / significant GO classes • More => http://genome.cshlp.org/content/8/3/163.full Phylogenetic tree methods • These should outperform sequence homology methods (CAFA 2011?) • Require a set of related genes • Often much heavier calculations • Programs: – Sifter (http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109) Prediction with Protein domains • Look what protein domains there are in query protein (PFAM) • Map the functions that are linked to domains to your query sequence – PFAM2GO • Programs: InterProScan + PFAM2GO • Drawbacks: – This mapping is same in plant, mammal, bacteria – Many domains to specific function Prediction with Protein domains • Benefits: – Can create annotation from separate domains – Similar seq:s do not have to be in database • Programs (?): InterProScan (http://www.ebi.ac.uk/InterProScan/) • Drawbacks: – The mapping is same in plant, mammal, bacteria – Many domains to specific function Prediction with patterns and motifs • Same principle as before, but we look sequence patterns and motifs • Map the functions that are linked to patterns to your query sequence • Programs: – InterProScan – IBM BioDictionary (http://cbcsrv.watson.ibm.com/Tpa.html) • Drawbacks and benefits appr. same as before Prediction with sequence features • Again same principle as before • We look seq. features (see pict.) • These are given as an input to classifier algorithm (Support Vector Machine) Prediction with sequence features Prediction with sequence features • Benefits: – No actual seq. similarity needed – Info collected from vague similarities – Use of classifier => feature weighting • Program: FFPred (http://bioinf.cs.ucl.ac.uk/ffpred/) • Drawbacks: • Calculations probably quite heavy • No use of nearby sequence similarities (domains etc.) Our contribution: PANNZER • Use BLAST result list • Add Taxonomic information • Score GO classes using a score that takes the frequency of GO class in seq. DB into account • Method is used to predict: – GO Classes – Description line Our contribution: PANNZER • Benefits: – Taking the species taxonomy into account – Improved use of statistics • Not public yet Our contribution: No Name Yet • Take PFAM domain predictions, BLAST similarities and Taxonomic information • Feed this to feature selection and to classifier algorithm • …Wait… • Method is used to predict GO-classes • Not public + testing is ongoing Conclusion • These methods increasingly needed • Some methods exist • Unfortunately no clear evaluation (my opinion) • Remember: These are predictions. No certain info until they are tested in wet lab…