Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis www.hytti.uku.fi/~toronen/Gradu_verkkoon.zip and from CSC bio-opas http://www.csc.fi/oppaat/bio/ http://www.csc.fi/oppaat/bio/bio-opas.pdf Why protein sequences? • most (laboratory) analysis is done with nucleotide sequences • therefore the analysis at the nucleotide level is natural But there are drawbacks -divergence in codons => same protein, different nucleotide sequence! http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html -similarity between different aminoacids Therefore all the similarity is not visible at the nucleotide level! …more… Protein databases also include often more detailed information. Protein (not the RNA) is often the actual functional unit that has a biological function. -note the exceptions like structural RNAs. Protein databases • SwissProt • TrEMBL • PIR-PSD Swissprot and TrEMBL (Translated EMBL) have been unified to UniProt THIS INFO IN PART ERRONEOUS! SwissProt still also available as a separate entity. Differences between databases • Some include all the available information (more or less reliable information) – large coverage, everything is stored in the database – small reliablity, information has not been confirmed – computer annotation => updating fast • Some cover only the reliable information – small coverage – information is reliable – expert curation => updating slow • SwissProt – TREMBL – RemTREMBL Why Swissprot is nice? • Sequences are manually annotated and checked • No multiple entries for the same sequence • Annotations include protein function, modifications after translation, active sites etc. • Linked to many other databases So how to search protein sequences from available databases? • Search with a protein name • Search with a proteins function/derscriptive words • Search with a protein/RNA sequence Next slides handle first two options… Ways to access Swiss/UniProt http://au.expasy.org/sprot/ Expasy server for Uniprot Note that the page includes links to ’full text search’ and to ’advanced search’ http://www.ebi.uniprot.org/uniprot-srv/uniProtPowerSearch.do Power Search to UniProt database http://srs.csc.fi/ One of the SRS servers availble in WWW http://srs.ebi.ac.uk http://srs.embl-heidelberg.de:8000/srs5/ SRS • Sequence Retrieval System • Allows search from several databases • not limited to SwissProt! • AND, OR, BUTNOT type boolean operations can be used in the search (useful with keywords) => Works with sequence name and with complex keyword queries. • Obtained results can be further processed: – linking to new set of databases – includes sequence analysis, sequence alingment Select ’start a temporary project’ Select database(s). Here I select SwissProt Note that also other databases can be searched with SRS! Available databases vary between the different SRS servers. These are available fields that can be searched with the search term Insert the query for looking the sequence. Here I search with the sequence name (csk_mouse). Search goes through all the text fields (AllText) in the SwissProt files obtained result More information from here Available information on the sequence. • Obtained result demonstrated the detailed information available from the SwissProt • Note that the stored information includes – – – – information on the organism gene name, gene description links to the articles discussing about the seq. part comments has a detailed description on • function • tissue localization – part features has a detailed description on • domains • various functional components SRS Search with boolean operators (AND, OR, BUTNOT) Queries can be combined with & (= AND), | (= OR), ! (=NOT) Different rows are also combined (by default) with AND The example looks for proteins with organism Name either mouse OR rat. Also the description field must include words receptor AND kinase BUTNOT tyrosine. Further linking to other databases Go to the results of the previous search.. We can link the obtained results with the other databases by going further from this link Selection of sequences that have a known 3D structure 3. Lets select here the filtering of the obtained results to the ones that have a link to 3D structure 2. The box next to PDB database is selected with mouse 1. The sub folder with protein databases is opened by selecting protein function structure and interactions databases Summary • protein databases show detailed information of protein sequences • Uniprot/Swissprot is recommended protein database -manually curated -non-overlapping • SRS is a method for searching information from selected databases with search terms • Word of warning: Sometimes SRS does not work as nicely as hoped! Search of the protein databases with sequences So what can be done if we have a sequence that we do not know nothing about? We can look for similar known protein from databases. This can be done directly with protein sequences. (Database searching is probably handled more later. Sorry for wrong order!) Nucleotide to amino acids If you have produced a nucleotide seq. in laboratory you might still want to compare it to protein sequences for previous reasons (slide n. 3). You’ll have two options: 1.Use tools (like BLASTX, FastX) that automatically compare the nucleotide seq. to amino acid databases. These can search sequence similarities going from one reading frame to another. => Simple, You don’t have to worry about translating the sequence (see below) BLASTX and FastX are explained more in detail later 2.Translate the seq. using available tools (for example http://www.ebi.ac.uk/emboss/transeq/ ) -required with tools that accept only protein sequence -remember that you do not know the reading frame! Correct reading frame can move from one frame to another (sequencing errors like addition or deletion of nucleotides)!! Automatic tools comparing nucl. seq. with protein database • BLASTX -looks for most similar protein sequences for your nucleotide sequence by comparing all possible reading frames. -Member of BLAST program family http://www.ncbi.nlm.nih.gov/BLAST/ If you do a query with a protein sequence then use this For nucleotide sequences BLASTX can be obtained here SEQUENCE: >embl|AB029485|AB029485 Mus musculus ARIP1 mRNA for activin receptor interacting protein protein database (SwissProt) can be selected here You can find the seq from google with AB029485 Next Window is opened here Web page that is given while the results are being waited. Colour figure presents where the match to the database was in our query sequence. colour presents the goodness of score. E value tells how many similar results can be expected by random The alingment can be viewed from this link This is the link to database that we searched giving the full information on the sequence The alingment enables the manual evaluation of the result Changing the nucleotides to amino acids http://www.ebi.ac.uk/emboss/transeq/ Transeq requires you to paste the nucleotide sequence, to select the reading frame (1, 2 or 3) and to select forward or reverse direction An example sequence obtained with randomly typed g,a,c,t: DQLTCQSTVSAGLAWLAG MA The obtained sequences from different reading frames can be used to search protein databases... Motif databases • Motifs are conserved areas in the functionally similar proteins • These are crucial parts for protein function – protein cannot change them without changing the function • Analysis of sequences with motifs can be more efficient when no close sequence relatives are found – recommended when normal sequence search gives no results What is motif? Areas with strong conservation between alingned sequences modified from Terri Attwood, 2002 modified from Eija korpelainen... Motif databases BLOCKS http://blocks.fhcrc.org/ PROSITE http://au.expasy.org/prosite/ ...and more... http://au.expasy.org/tools/ Subgroup Pattern and profile searches shows the list of protein motif analysis tools INTERPRO http://www.ebi.ac.uk/InterProScan/ Combines many motif databases in one search can take DNA or protein sequence. Fragment of the BLASTX test sequence WW domains Important for binding proteins PDZ domains Important for protein-interactions Kinase associated motifs