Part I: Identifying sequences with … Speaker : Date 11-01-2005 S. Gaj Annotation Annotation Background • Best possible description available for a given sequence at the current time. How to annotate? • Combining • • • Alignment Tools Databases Datamining (scripts) Microarrays Introduction Global alignment Background • Optimal alignment between two sequences containing as much characters of the query as possible. Ex: predicting evolutionary relationship between genes, … Local alignment • Optimal alignment between two sequences identifying identical area(s) Ex: Identifying key molecular structures (S-bonds, ahelices, …) Introduction Basic Local Alignment Search Tool Aligning an unknown sequence (query) against all sequences present in a chosen database based on a score-value. • Aim : BLAST • Obtaining structural or functional information on the unknown sequence. Programs BLAST • • Different BLAST programs available Protein Nucleic BlastN BlastX Protein - BlastP Usable criteria: • • Nucleic E-Value, Gap Opening Penalty (GOP), Gap Extension Penalty (GEP), … Terms • • • Query Subject Hit Sequence which will be aligned Sequence present in database Alignment result. Common BLAST problems • BlastN BLAST C GA T A C GC C A GG - A T A T A C C | | | | | | | | | | | | | | | | | | | C GA T A C GC C A GGGA T A T A C C Sequencing Error • Solution: Low penalty for GOP and GEP = 1 Clone seq mRNA Translation Problems • 6-Frame translation BLAST >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. +1 L A L * P S S Q H E G S H C S G A ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct... Translation Problems • 6-Frame translation >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. BLAST +3 +2 +1 * L H A S L D * L P A S V S N Q H M K E A G L S H I V L C S G G A ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct... -1 -2 -3 Common BLAST problems intron exon BLAST Gene X Translation full mRNA Splicing mRNA Common BLAST problems Coding region Non-coding region BLAST mRNA Clones derived from mRNA BlastX against protein sequence 3 possible hit-situations Common BLAST problems Coding region Non-coding region BLAST Yields no protein hit Aligns with protein in 1 of the 6 frames. or Part perfect alignment Part II: Databases and annotation Introduction Primary database: – – Databases – DNA Sequence (EMBL, GenBank, … ) AminoAcid Sequence (SwissProt, PIR, …) Protein Structure (PDB, …) Secondary database: – – – Derived from primary DB DNA Sequence (UniGene, RefSeq, …) Combination of all (LocusLink, ENSEMBL, …) Structure: – Flat file databases Primary Databases EMBL: – – Databases – – – DNA Sequence Human: 4.126.190.851 nucleotides in 292.205 entries Clones, mRNA, (Riken) cDNA, … New sequences can be admitted by everyone. No curative check before admittance. Primary Databases SwissProt: – – Databases – – – – – – Amino Acid sequence Human: Contains protein information SwissProt (EU) PIR (USA) Crosslinks to most informative DB (PDB, OMIM) Part of UniProt consortium. Each addition needs validation by appointed curators. Highly curated Secondary Databases TrEMBL: – Translated EMBL Hypothetical proteins – After careful assessment SpTrEMBL SwissProt Databases – Secondary Databases UniGene: – – Databases – – Automated clustering of sequences with high similarity Derived from GenBank / EMBL 1 consensus-sequence Species-specific Secondary Databases LocusLink: – Databases – Curated sequences Descriptive information about genetic loci RefSeq: – – – – Non-redundant set of sequences. Genomic DNA, mRNA, Protein Stable reference for gene identification and characterization. High curation Database Quality? Databases DNA mRNA Protein EMBL SwissProt Submitter Submitter Curators Database Manager Database Manager How to Annotate? BlastN against random nucleotide DB – Databases EST’s BlastN against structured nucleotide DB (UniGene, RefSeq) – – – mRNA hits Sometimes not annotated at all Best information Microarrays Part III: Annotation Techniques Annotation What do we have? Probe sequence Alignment Tools (e.g. BLAST) Databases !?! What to choose ?!? Possibilities? Annotation 1. Do it like everyone else does. 2. Make use of curative properties of certain databases Goal: Annotate as many genes with as much information as possible (e.g. SwissProt ID) Annotation Techniques 1st Approach - General “Done by most array manufacturers” Step-by-step approach: – BLAST sequences against nucleic database (preferably UniGene) – Extract high quality (HQ) hits (>95%) – For each HQ hit search crosslinks. – Find a well-described (SwissProt) ID for each sequence. Annotation Techniques 1st Approach - Concept Annotation Techniques 2nd Approach - General “Make use of present database curation” Other way around: – Use SwissProt to clean out EMBL – Result: “Cleaned” EMBL database with direct SP crosslinks – BLAST against cEMBL – Extract high quality alignment hits (>95%) – Convert EMBL ID to SP ID. Annotation Techniques 2nd Approach - Concept Annotating Incyte Reporters Results Total: 13.497 cEMBL-approach: 2.898 (21,47%) SP-IDs DM approach: 10.013 (74,18%) UG-IDs in which M = 4.723 (34,9%) SP-IDs ; MR = 5.147 (38,1%) SP-IDs; MRH = 6.641 (49,2%) SP-IDs Annotating Incyte Reporters All reporters present on “Incyte Mouse UniGene 1” converted Results Total: 9.596 reporters Old annotation : 9.370 (97,6%) UG-IDs in which Non-existing UG-IDs = 5.713 (59,5%); M = 1.939 (20,2%) SP-IDs; MR = 2.096 (21,8%) SP-IDs; MRH = 2.582 (26,9%) SP-IDs Datamining approach : 8.532 (88,9%) UG-IDs in which M = 4.145 (43,2%) SP-IDs ; MR = 4.499 (38,1%) SP-IDs; MRH = 5.576 (60,1%) SP-IDs Custom EMBL-approach : 2.898 (30,2%) SP-IDs Annotating Incyte Reporters Combined methods “Incyte Mouse UniGene 1” reporters Results Total: 9.596 reporters No annotation : 1.062 (11%) reporters Annotated with SP-ID : 5.895 (61,3%) reporters of which 2.184 (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method; 174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method Conclusions • Annotation is much needed • Direct translation into protein not best option: Conclusions • Array sequences can point to different genes Sequencing errors Addition or deletion of nucleotides 6-Frame window Public nucleotide databases are redundant. Sequencing errors Differences in sequence-length Attachment of vector-sequence End Questions?