Name - 基因组和生物技术研究中心

advertisement
Functional Annotation
基因功能预测
唐海宝
基因组与生物技术研究中心
2013年11月23日
Functional Annotation
?
Name that protein
?
C2H2 zinc finger proteins
Calmodulin and calmodulin-related calcium sensor proteins
Cellulose Synthase Gene Family
Cysteine Rich Peptides
Cytochrome P450
Early Auxin-responsive Aux/IAA Gene Family
F-Box Proteins
Glycosyl Hydrolase
MADS-box family
Serine Proteases
WRKY family
……
Erythropoietin (促红细胞生成素)
Myostatin (肌肉生长限制因子)
Outline
• Basic Searches to Run
• Advanced Assignments
• Protein Families
• Naming Genes
1. Basic Searches to Run
Basic Searches to Run
• BLAST (nucleotide or protein homology)



Non-redundant protein sequences (nr)
UniRef (UniProt - Swiss-Prot, TrEMBL)
Trusted genomes (TAIR)
• CDD (NCBI’s Conserved Domain Database)
• Interpro (protein families, domains and functional sites)
• HMMER or SAM (searches using statistical descriptions)




Pfam (database of protein families and HMMs)
TIGRFAMS (protein family based HMMs)
SCOP (Structural domains)
TMHMM (Transmembrane domains)
• SignalP (signal peptide cleavage sites)
• TargetP (subcellular location)
• Many others
Web BLAST
•
•
•
•
•
•
•
NCBI Blast http://www.ncbi.nlm.nih.gov/blast/
WU blast http://genome.wustl.edu/tools/blast/
Uniprot-swissprot blast http://www.uniprot.org/
Phytozome http://www.phytozome.net/search.php
The Gene Indices http://compbio.dfci.harvard.edu/tgi/
Sanger projects http://www.sanger.ac.uk/DataSearch/
TAIR - http://www.arabidopsis.org/Blast/index.jsp
CDD
• Collection of multiple sequence alignments
• Contains protein
domain models
imported from
outside sources,
such as Pfam,
SMART, COGs
(Clusters of
Orthologous
Groups of
proteins), PRK
(PRotein
Klusters), and are
curated at NCBI.
InterPro
• Database of protein families, domains and functional sites in which
identifiable features found in known proteins can be applied to
unknown protein sequences.
Hidden Markov Model
• Databases of HMM domains to search:
•
•
•
•
Pfam: http://www.sanger.ac.uk/Software/Pfam/
TIGRFAMs: http://www.jcvi.org/cms/research/projects/tigrfams/overview/
SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/
TMHMM: http://www.cbs.dtu.dk/services/TMHMM/
• Tools to use:
• HMMER, HMMPFAM:
http://hmmer.janelia.org/
Pfam
• For each family in Pfam you can:
• Look at multiple alignments
• View protein domain architectures
• Examine species distribution
• Follow links to other databases
• View known protein structures
TMHMM
• Predicts transmembrane helices in
integral membrane proteins using
HMM’s
SignalP
• Predicts the presence and
location of signal peptide
cleavage sites in amino acid
sequences from different
organisms.
• Based on a combination of
artificial neural networks
and HMMs.
TargetP
• TargetP predicts the subcellular location of eukaryotic proteins.
• The location assignment is based on the predicted presence of any of
the N-terminal presequences:
• chloroplast transit peptide (cTP)
• mitochondrial targeting peptide (mTP)
• secretory pathway signal peptide (SP)
Gene function evidence
2. Advanced Assignments
Advanced Assignments
• Enzyme Commission (EC)
Numberhttp://www.chem.qmul.ac.uk/iubmb/enzyme/
• Gene Ontology (GO) Terms
• Pathways



KEGG
MetaCyc
Pathway Tools
Assigning EC Number
• EC classification scheme is a hierarchical numerical classification
based on the chemical reactions enzymes catalyze.
• Every enzyme code consists of four numbers separated by periods.
Ex.- EC 1.1.1.1- alcohol dehydrogenase
• EC numbers may be assigned computationally.
• There are many available tools and methods for predicting EC
numbers and pathways.
• Common problems:

The computational method may not be specific for assigning EC number
to the enzymes. It may be accurate to decide an enzyme family for a
gene rather than a specific enzyme. To be precise, the fourth number
(Ex. 1.1.1-) is often left blank.
GO Terms
• Gene Ontology (Gene Ontology Consortium™ ) is a
method used to structure biological knowledge using a
dynamic controlled vocabulary across organisms.

Molecular function (MF)
–
–
–

Biological process (BP)
–
–

What the gene product does
Think ‘activity’
Ion channel activity
A biological objective
Ion transport, transmembrane transport
Cellular component (CC)
–
–
–
Location in the cell (or smaller unit)
Or part of a complex
Membrane, plasma membrane
• You can obtain GO for any sequence using tools like:


BLAST2GO
INTERPRO2GO
View Pathways
• Graphical interface for users to visualize the substrates, final
products and steps in a completed pathway catalyzed by an
enzyme (gene).



KEGG: http://www.genome.jp/kegg/tool/search_pathway.html
MetaCyc: http://metacyc.org
Pathway Tools: http://bioinformatics.ai.sri.com/ptools
Pathway Tools
3. Protein Families
Why Compute Protein Families?
• To group proteins by probable function
• To identify possible gene structure problems
• To identify evolutionary relationships
between protein families.
• Gene naming and Transposable Element
assignment
Domain Based Protein Families
(Paralogous families)
protein sequences
Identify Pfam and
all vs all blastP
based domains
Families grouped based on
type and number of
domains
Domain Based Protein Families
(Paralogous families)
protein sequences
Identify Pfam and
all vs all blastP
based domains
9 family members contain:
PF00027 - Cyclic nucleotide-binding
domain
PF00520 - Ion transport protein
para_246
OrthoMCL/TribeMCL Protein Clustering
• Markov clustering method for grouping
proteins into families
• http://doc.bioperl.org/bioperl-run/lib/Bio/Tools/Run/TribeMCL.html
Nucleic Acids Res. 2002 April 1; 30(7): 1575–1584.
4. Naming Genes
Functional Assignments
Name
Descriptive common name for the protein, with as much
specificity as the evidence supports; gene symbol.
Role
Describe what the protein is doing in the cell and why.
Associated information:
Supporting evidence: Domain and motifs
EC number if protein is an enzyme
Paralogous family membership
Naming convention
Methods to name gene products
1. Top BLAST hit to database of choice
2. Manually aggregate evidence from multiple sources
3.
Automated Assignment of Human Readable Descriptions (AHRD)
https://github.com/groupschoof/AHRD
Automated Human Readable Description
(AHRD)
Automated Human Readable Description
(AHRD)
Automated Human Readable Description
(AHRD)
https://github.com/groupschoof/AHRD
练习
• 已知蛋白序列,命名
• 使用在线工具查找结构域和功能域
Download