Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas http://www.csc.fi/oppaat/bio/ http://www.csc.fi/oppaat/bio/bio-opas.pdf Why protein sequences? • most (laboratory) analysis is done with nucleotide sequences • therefore the analysis at the nucleotide level is natural But there are drawbacks: -divergence in codons => same protein, different nucleotide sequence! http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html -similarity between different aminoacids Therefore all the similarity is not visible at the nucleotide level! …more… Protein databases also include often more detailed information. Protein (not the RNA) is often the actual functional unit that has a biological function. -note the exceptions like structural RNAs. Various protein (related) databases • Databases including protein sequences – UniProt • Databases including protein domains – PFAM – PROSITE • Databases including protein sequence patterns, motifs – PROSITE Differences between databases ”Size” of included data components: • ”Large” components: – Whole sequences • ”Medium” components – Protein domains – http://en.wikipedia.org/wiki/Protein_domain • ”Small” components – Protein sequence motifs – http://en.wikipedia.org/wiki/Sequence_motif • Protein sequence can include many domains and domains can have many motifs Differences between databases • Some include all the available information (more or less reliable information) – large coverage, everything is stored in the database – small reliablity, information has not been confirmed – computer annotation => updating fast • Some cover only the reliable information – small coverage – information is reliable – expert curation => updating slow • SwissProt (curated) ↔ TREMBL (uncurated) Differences between databases Why previous division? • Some protein features/functions are linked to domains • Some features/functions are linked to specific sequence motifs • Some features can be best described at the whole sequence level Protein sequence databases • UniProt • SwissProt + TREMBL • PIR-PSD • Lets focus on SwissProt Why Swissprot is nice? • Sequences are manually annotated and checked • No multiple entries for the same sequence • Annotations include protein function, modifications after translation, active sites etc. • Linked to many other databases • Similarity to RefSeq So how to search protein sequences from available databases? • Search with a protein name • Search with a proteins function or descriptive words • Search with a protein/RNA sequence WWW link for first two options… http://www.uniprot.org/uniprot/ Searching Uniprot • Demonstrate the search by looking protein kinase proteins from human Choose database Type query here Here you limit search to SwissP. Lets first go to Advanced Search Select field here Type query here 1. Select field as protein name 2. Type query: protein kinase We get all sequences that have both words (protein AND kinase) in their description After previous results open new search row from Advanced Search Next select organism from field and type homo sapiens. Click Add&Search RESULTS: Here you can look common features among the obtained sequences Here limit to Swissprot More info on hits by clicking the gene name Lets open one for better view… Different fields of information can be found when scrolling down the page NOTICE: Detailed description of function → General annotation Alternative splice variants and mutations reported → Alternative products → Natural variations • Obtained result demonstrated the detailed information available from the SwissProt • Note that the stored information includes – – – – information on the organism gene name, gene description links to the articles discussing about the seq. Comment part has a detailed description on • function • tissue localization – features part has a detailed description on • domains • various functional components Extra Slide Go back to search results Test these Select keyword, and open Disease list for better viewing… Extra Slide You can view which genes have been reported to be involved in some diseases Note that 18 are linked to tumor suppressors and 36 to Proto-oncogenes Summary • protein databases show detailed information of protein sequences • Uniprot/Swissprot is recommended protein database -manually curated -non-overlapping • Swissprot can show very detailed information on sequences Sequence Motifs • Motifs are conserved areas in the functionally similar proteins • These are crucial parts for protein function – protein cannot change them without changing the function • Analysis of sequences with motifs can be more efficient when no close sequence relatives are found – recommended when normal sequence search gives no results http://en.wikipedia.org/wiki/Sequence_motif What is motif? Areas with strong conservation between alingned sequences Multiple sequence alingment of sequences with similar function modified from Terri Attwood, 2002 modified from Eija korpelainen... Domain databases • Domain is a sub-component of protein • It can exist and function independently from the rest of the protein sequence • Domains form often a building blocks in the evolution that are combined to form proteins • Same domain can occur in various proteins • http://en.wikipedia.org/wiki/Protein_domain Domain and motif databases • PFAM • PROSITE • PRINTS • TIGRFAM • PRODOM … and many more Domain and motif databases • PFAM • PROSITE • PRINTS • TIGRFAM • PRODOM … http://www.ebi.ac.uk/interpro/ http://www.ebi.ac.uk/interpro/about.html All are combined Into one service → InterPro What is InterPro • Collection of many protein related databases • All aim to report various features that can be used to analyze sequences • Features: • Domains, Sequence motifs, Global sequence homology • Different databases can queried simultaneously via InterPro What is InterPro • This generates large amount of information for single query • Good chance to get useful information for unknown sequence • Some databases are well annotated • Drawback is the repetition in the results from different databases • Queries are also SLOW How to use InterPro • Sequence queries to InterProScan Sequence here Lets use Serine/threonine protein kinase N1 sequence as query This sequence was in Uniprot results Results Click titles for more info Query name Sequence here Visualization of results Domain associated with one region of sequence Lets check more information on reported domains…. Results Contributing signatures from many databases Sequence signatures, found by InterProScan, usually have a detailed description Results • InterProScan gives us matches in the sequence to various sequence features – Domains, motifs • These features are often well annotated • Features associate functions to specific regions of sequence Other Databases • Databases describing gene functions – Gene Ontology databases – Reaction pathway databases • Databases describing associations to phenotypes – Disease gene databases – Phenotype databases Databases describing functions Why do we need these databases? • Earlier databases were helpful when analysis starts from unknown single gene • These databases help us to find all genes known to be linked to certain task – Say, all apoptosis-related genes in human • They are also helpful when we analyze large sets of genes – Is there something common among 100 genes that are most active in cancer cell? Databases describing functions • Gene Ontology databases – Classify genes into categories that describe gene function – Standardized classification applicable to all species – Classes represent involvement in biological tasks (like protein synthesis), chemical activities (like carbohydrate binding) or localization in cell (like nucleus) • http://en.wikipedia.org/wiki/Gene_ontology Databases describing functions • Pathway databases – Classify genes into biochemical pathways – Classify genes into signalling pathways • Example databases: – KEGG: www.genome.ad.jp/kegg/ – REACTOME: http://www.reactome.org/ • http://en.wikipedia.org/wiki/Biological_path way www.geneontology.org • The Gene Ontology (GO) is a hierarchical structure for categorizing gene products in terms of their association with: • 1. biological processes • 2. cellular components • 3. molecular functions • in a species-independent manner Structure of Gene Ontology • Hierarchical structure of linked nodes • Smaller classes: child classes root of hierarchical structure • Precise, detail information • Larger classes: parent classes • Broad, unspecific information • Smaller classes belong to larger classes • Viral protein biosynthesis => • Protein biosynthesis => • Biosynthesis Starting node Gene Ontology databases • AmiGO http://amigo.geneontology.org/cgibin/amigo/go.cgi • QuickGO http://www.ebi.ac.uk/QuickGO/ AmiGO • Server maintained by GO consortium for analysis gene annotations across the species • http://amigo.geneontology.org/cgi-bin/amigo/go.cgi AmiGO Select: GO-terms Or gene names Query here This limits to exact match AmiGO Assosiated genes We get the precise definition of the class AmiGO Lets have a view on genes associated to apoptosis in yeast (Saccharomyces Cerevisiae) Here you can limit the species Selected genes could be taken to a more detailed laboratory analysis… Databases describing functions • These group genes into classes or pathways • Databases can be queried to see which genes are in certain class / pathway • You can also check to which classes a certain gene belongs to Databases summary • • • • • Nucleotide databases Genome databases Protein databases Protein motif / domain databases Function related databases WAKE UP!