PROJECT DESCRIPTION

advertisement
PROJECT DESCRIPTION
OBJECTIVES AND EXPECTED SIGNIFICANCE
Protein space
Annotation of proteins
- Current gene finders
- Debates on gene content, false positives, and false negatives
- Presence of novel genes as revealed by large-scale cDNA sequencing and tiling array
analysis.
Protein domain
- structural and functional units
- evolutionary conservation and its importance for functional inference
Unanswered questions
- While some annotated proteins may be false positives, large number of novel proteins
likely remains un-detected in plant genomes.
- Substantial plant proteins remain domain-less
- Substantial proportion of plant proteins are not covered by domains.
- Definition of domain family and classification of plant genes
Objectives: enriching the description of plant protein space
I. Classify plant proteins and identify orthologous gene sets
Markov clustering
Tree building
Singletons and orphans
Novel genes
Database – plant gene families
II. Cluster homologous regions of plant proteins into domain families
Iterative similarity search (Psi-BLAST)
Position-Specific Score Matrix (PSSM) and Hidden Markov Models (HMMs)
Tree building
Database – plant domain families
IV. Identify and map functional information to orthologous groups
Define orthologous group
- phylogenetic approach
- synteny map
Functional info
- Expression data
- Protein-protein interaction of model organisms and others
III. Annotate plant ESTs by gene family, domain family, and orthologous groups
EST anchoring
ESTs that cannot be anchors – assess coding potential, conservation, and characteristics of
non-coding RNAs
Significance and the team
I. Intellectual merit
 Continued growth of sequence database, the number of genomes, and ESTs
 Thorough annotation of the protein space, allow description of regions not covered by
known domains in a plant-specific manner
 Utility of gene family information, orthology assignment, and EST anchoring in
functional annotation of plant and other genomes
 Integration of resources from other model organisms – interaction mapping
 Utility in the study of evolutionary genomics, gene family dynamics.
 Complement the
II. Broader impacts
Biological research has entered a new phase where vast sequence information and
functional data have rapidly become available, fueling a large number of exciting discoveries.
This data influx has also created a great need for a new generation of biologists that have
computational skills. The proposed research program will create an interdisciplinary training
environment for students and staffs. To broaden dissemination of understanding on science in
general and our research project in particular, the PI has established a partnership with the East
Lansing Public Library and plans to develop activities for the public on topics of biological
science, evolution, and genomics (detailed in the last section of the project description). As the
debates on evolution and intelligent design rage on and the public remains, in general, ignorant
of new developments in science such as genomics, our planned activities will provide
important opportunities for the public to understand how science works, what the facts of
evolution are, what genomics is, and how our proposed research project advances scientific
understanding. To disseminate our discoveries, a website hosting the data will be created.
RESULTS FROM PRIOR NSF SUPPORT
The PI, Shin-Han Shiu is a new investigator with no prior NSF support. However, Shiu has
extensive collaboration with three research groups on their NSF-funded projects in the past four
years. The first is with Richard Vierstra (University of Wisconsin) on E3 ubiquitin ligases (NSF
Arabidopsis 2010; REF; Gingerich, Hanada, Shiu, and Vierstra, in preparation). The second is with
Ming-Che Shih (University of Iowa) on the β-glucosidase and β-galactosidase gene family evolution
(NSF Arabidopsis 2010; Shih and Shiu, submitted). The third is on the functional studies of plant
receptor-like kinases with John Walker (Univ. of Missouri) and Frans Tax (Univ. of Arizona)
funded by NSF.
BACKGROUND
Brief intro to:
1) Protein space, gene annotation, conserved regions, structural domains
2) Novel genes likely present in the genome
3) Protein domains/motifs and their uses in functional annotation, structural biology, and
others.
4) Way to describe protein space – focus on breaking genes into non-overlapping families
5) Current domain database
What’s missing:
1) Need to take into account the frequent domain fusion and fission events, so not global
but local sequence similarity is what should be focused on.
2) Need to take into account fast evolving proteins by analyzing related species
3) Large number of proteins do not have domains. Significant proportion of the proteins
are not covered by known domain/motifs
Link to the objectives.
Significance
- annotation of plant genes
- understanding gene family evolution
- generation of hypothesis of gene function based on orthology
Links to preliminary results
PRELIMINARY STUDIES
To determine the feasibility of the proposed methods, pilot studies have been conducted in X
areas.
RESEARCH PLAN
We propose to …. The workflow is shown in Figure X. The experimental plans are detailed
below.
I. Classify plant proteins and identify orthologous gene sets
Blah
I-1. Clustering of plant protein families based on global sequence similarity
Markov clustering
- Multiple granularity setting
- Break cluster into subclusters that are more alignable
Tree building
I-2. Significance of orphans
Singletons or orphans
- Fast evolving, with similarity to some plant sequences
- Simply not annotated in other genomes
- False positives
I-3. Identification of novel coding sequences
Novel gene
- Coding potential
- Evolutionary conservation
- Tiling array data
II. Cluster homologous regions of plant proteins into domain families
II-1. Iterative similarity search based on local sequence similarity
Iterative similarity search (Psi-BLAST)
- Deal with repeat family
- Low complexity
- Signal sequences, transmembrane domain
- Fragmentation – two sets, one with only annotation with cDNA as seeds, the other use
everything.
II-2. Populating statistical models and phylogenetic trees for domain families
Position-Specific Score Matrix (PSSM)
Hidden Markov Models (HMMs)
- Evaluate redundancy against InterPro
- Evaluate redundancy within the domain set
Tree building
- Deal with repeat family
- Short sequences
Broad dissemination: contact Uniprot?
III. Identify and map functional information to orthologous groups
Define orthologous group
- phylogenetic approach – algorithm, bootstrap support, gene or domain family
- synteny map
Functional info
- Expression data
- Protein-protein interaction of model organisms and others
IV. Annotate plant ESTs by gene family, domain family, and orthologous groups
EST anchoring
ESTs that cannot be anchors – assess coding potential, conservation, and characteristics of
non-coding RNAs
MANAGEMENT PLAN
I. Work assignment and data exchange
II Sharing of results and management of intellectual property
How to sustain the website and analysis pipeline without assuming long-term NSF support.
INTEGRATION OF RESEARCH AND EDUCATION
I. Partnership with the East Lansing Public Library
Shin-Han Shiu has established a partnership with Sylvia Marabate (Director), Julie
Pierce (Head of Adult and Children’s Services), and Mary Hennessey (Young Adult
Coordinator) of the East Lansing Public Library (ELPL) to develop activities aiming to
enhance the general public’s understanding of plant science, evolution, and genomics using the
proposed project as an example. In our partnership, the Shiu lab will provide scientific
expertise and ELPL will devote personnel for event planning and space. ELPL has extensive
experience in hosting outreach programs for all age groups and in attracting a broad audience
in central Michigan. Since all current programs in ELPL focus on literature, theater, and fine
arts, our planned science program will be a unique opportunity to educate the public about
science. Our outreach program will have three components:



Multimedia presentation: the PI will develop slideshows with ELPL staff to portray how
science works, what the facts of evolution are, what genomics is, and how our proposed
research project advances scientific understanding.
Interactive discussion session: This component includes a question-answer session on the
presentation subjects and interactive, hands-on activities that will allow participants to gain
a better understanding of evolution as a scientific discipline.
“Field trip”: The audience will tour the Shiu lab to find out how scientists work with a
focus on how we carry out computational analyses and experimental studies. After the tour,
the participants will have the opportunity to ask questions and provide feedback for
improving future programs.
According to the ELPL staff’s experience in working with the general public, we expect our
outreach effort will reach an audience of approximately 100 teens/adults per year. We plan to
host three workshops per year for two years. The planned activities will not only provide a
channel for the public to see how our research plan contributes to science but also enhance
their understanding of how the process of science works in general, fulfilling the NSF’s goal of
broad dissemination to enhance scientific and technological understanding.
Download