Detection of alpha-catenin in rupture

advertisement
Part I:
Identifying sequences with …
Speaker :
Date 11-01-2005
S. Gaj
Annotation
Annotation
Background
•
Best possible description available for a given
sequence at the current time.
How to annotate?
•
Combining
•
•
•
Alignment Tools
Databases
Datamining (scripts)
Microarrays
Introduction
Global alignment
Background
•
Optimal alignment between two sequences
containing as much characters of the query as
possible.
Ex: predicting evolutionary relationship between genes, …
Local alignment
•
Optimal alignment between two sequences
identifying identical area(s)
Ex: Identifying key molecular structures (S-bonds, ahelices, …)
Introduction
Basic Local Alignment Search Tool
Aligning an unknown sequence (query) against all
sequences present in a chosen database based on a
score-value.
•
Aim :
BLAST
•
Obtaining structural or functional information on the
unknown sequence.
Programs
BLAST
•
•
Different BLAST programs available
Protein
Nucleic
BlastN
BlastX
Protein
-
BlastP
Usable criteria:
•
•
Nucleic
E-Value, Gap Opening Penalty (GOP), Gap Extension Penalty
(GEP), …
Terms
•
•
•
Query
Subject
Hit
Sequence which will be aligned
Sequence present in database
Alignment result.
Common BLAST problems
•
BlastN
BLAST
C GA T A C GC C A GG - A T A T A C C
| | | | | | | | | | | |
| | | | | | |
C GA T A C GC C A GGGA T A T A C C
Sequencing Error
•
Solution:
Low penalty for GOP and GEP = 1
Clone seq
mRNA
Translation Problems
•
6-Frame translation
BLAST
>embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an
Alu repeat in the 3' flank.
+1
L
A
L
*
P
S
S
Q H
E
G
S H
C S G
A
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...
Translation Problems
•
6-Frame translation
>embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an
Alu repeat in the 3' flank.
BLAST
+3
+2
+1
*
L
H
A
S
L
D
*
L
P
A
S
V
S
N
Q H
M
K
E
A
G
L
S H
I
V
L
C S
G
G
A
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...
-1
-2
-3
Common BLAST problems
intron
exon
BLAST
Gene X
Translation
full mRNA
Splicing
mRNA
Common BLAST problems
Coding region
Non-coding region
BLAST
mRNA
Clones derived
from mRNA
BlastX against
protein sequence
3 possible hit-situations
Common BLAST problems
Coding region
Non-coding region
BLAST
 Yields no protein hit
 Aligns with protein in
1 of the 6 frames.
or
 Part perfect alignment
Part II:
Databases and annotation
Introduction
Primary database:
–
–
Databases
–
DNA Sequence (EMBL, GenBank, … )
AminoAcid Sequence (SwissProt, PIR, …)
Protein Structure (PDB, …)
Secondary database:
–
–
–
Derived from primary DB
DNA Sequence (UniGene, RefSeq, …)
Combination of all (LocusLink, ENSEMBL, …)
Structure:
–
Flat file databases
Primary Databases
EMBL:
–
–
Databases
–
–
–
DNA Sequence
Human: 4.126.190.851 nucleotides in 292.205 entries
Clones, mRNA, (Riken) cDNA, …
New sequences can be admitted by everyone.
No curative check before admittance.
Primary Databases
SwissProt:
–
–
Databases
–
–
–
–
–
–
Amino Acid sequence
Human:
Contains protein information
SwissProt (EU)  PIR (USA)
Crosslinks to most informative DB (PDB, OMIM)
Part of UniProt consortium.
Each addition needs validation by appointed curators.
Highly curated
Secondary Databases
TrEMBL:
–
Translated EMBL
Hypothetical proteins
–
After careful assessment  SpTrEMBL  SwissProt
Databases
–
Secondary Databases
UniGene:
–
–
Databases
–
–
Automated clustering of sequences with high similarity
Derived from GenBank / EMBL
1 consensus-sequence
Species-specific
Secondary Databases
LocusLink:
–
Databases
–
Curated sequences
Descriptive information about genetic loci
RefSeq:
–
–
–
–
Non-redundant set of sequences.
Genomic DNA, mRNA, Protein
Stable reference for gene identification and
characterization.
High curation
Database Quality?
Databases
DNA
mRNA
Protein
EMBL
SwissProt
Submitter
Submitter
Curators
Database
Manager
Database
Manager
How to Annotate?

BlastN against random nucleotide DB
–
Databases

EST’s
BlastN against structured nucleotide DB
(UniGene, RefSeq)
–
–
–
mRNA hits
Sometimes not annotated at all
Best information
Microarrays
Part III:
Annotation Techniques
Annotation
What do we have?

Probe sequence

Alignment Tools (e.g. BLAST)

Databases
!?! What to choose ?!?
Possibilities?
Annotation
1. Do it like everyone else does.
2. Make use of curative
properties of certain
databases
Goal:
Annotate as many genes with
as much information as
possible (e.g. SwissProt ID)
Annotation Techniques
1st Approach - General

“Done by most array manufacturers”

Step-by-step approach:
–
BLAST sequences against nucleic database
(preferably UniGene)
–
Extract high quality (HQ) hits (>95%)
–
For each HQ hit search crosslinks.
–
Find a well-described (SwissProt) ID for each sequence.
Annotation Techniques
1st Approach - Concept
Annotation Techniques
2nd Approach - General

“Make use of present database curation”

Other way around:
–
Use SwissProt to clean out EMBL
–
Result:
“Cleaned” EMBL database with direct SP crosslinks
–
BLAST against cEMBL
–
Extract high quality alignment hits (>95%)
–
Convert EMBL ID to SP ID.
Annotation Techniques
2nd Approach - Concept
Annotating Incyte Reporters
Results
Total: 13.497
cEMBL-approach: 2.898 (21,47%) SP-IDs
DM approach: 10.013 (74,18%) UG-IDs in which
M = 4.723 (34,9%) SP-IDs ; MR = 5.147 (38,1%) SP-IDs; MRH = 6.641 (49,2%) SP-IDs
Annotating Incyte Reporters
All reporters present on “Incyte Mouse UniGene 1” converted
Results
Total: 9.596 reporters
Old annotation : 9.370 (97,6%) UG-IDs in which
Non-existing UG-IDs = 5.713 (59,5%); M = 1.939 (20,2%) SP-IDs;
MR = 2.096 (21,8%) SP-IDs; MRH = 2.582 (26,9%) SP-IDs
Datamining approach : 8.532 (88,9%) UG-IDs in which
M = 4.145 (43,2%) SP-IDs ; MR = 4.499 (38,1%) SP-IDs; MRH = 5.576 (60,1%) SP-IDs
Custom EMBL-approach : 2.898 (30,2%) SP-IDs
Annotating Incyte Reporters
Combined methods “Incyte Mouse UniGene 1” reporters
Results
Total: 9.596 reporters
No annotation : 1.062 (11%) reporters
Annotated with SP-ID : 5.895 (61,3%) reporters of which
2.184 (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method;
174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method
Conclusions
•
Annotation is much needed

•
Direct translation into protein not best option:
Conclusions



•
Array sequences can point to different genes
Sequencing errors
Addition or deletion of nucleotides
6-Frame window
Public nucleotide databases are redundant.



Sequencing errors
Differences in sequence-length
Attachment of vector-sequence
End
Questions?
Download