PROJECT DESCRIPTION OBJECTIVES AND EXPECTED SIGNIFICANCE Protein space Annotation of proteins - Current gene finders - Debates on gene content, false positives, and false negatives - Presence of novel genes as revealed by large-scale cDNA sequencing and tiling array analysis. Protein domain - structural and functional units - evolutionary conservation and its importance for functional inference Unanswered questions - While some annotated proteins may be false positives, large number of novel proteins likely remains un-detected in plant genomes. - Substantial plant proteins remain domain-less - Substantial proportion of plant proteins are not covered by domains. - Definition of domain family and classification of plant genes Objectives: enriching the description of plant protein space I. Classify plant proteins and identify orthologous gene sets Markov clustering Tree building Singletons and orphans Novel genes Database – plant gene families II. Cluster homologous regions of plant proteins into domain families Iterative similarity search (Psi-BLAST) Position-Specific Score Matrix (PSSM) and Hidden Markov Models (HMMs) Tree building Database – plant domain families IV. Identify and map functional information to orthologous groups Define orthologous group - phylogenetic approach - synteny map Functional info - Expression data - Protein-protein interaction of model organisms and others III. Annotate plant ESTs by gene family, domain family, and orthologous groups EST anchoring ESTs that cannot be anchors – assess coding potential, conservation, and characteristics of non-coding RNAs Significance and the team I. Intellectual merit Continued growth of sequence database, the number of genomes, and ESTs Thorough annotation of the protein space, allow description of regions not covered by known domains in a plant-specific manner Utility of gene family information, orthology assignment, and EST anchoring in functional annotation of plant and other genomes Integration of resources from other model organisms – interaction mapping Utility in the study of evolutionary genomics, gene family dynamics. Complement the II. Broader impacts Biological research has entered a new phase where vast sequence information and functional data have rapidly become available, fueling a large number of exciting discoveries. This data influx has also created a great need for a new generation of biologists that have computational skills. The proposed research program will create an interdisciplinary training environment for students and staffs. To broaden dissemination of understanding on science in general and our research project in particular, the PI has established a partnership with the East Lansing Public Library and plans to develop activities for the public on topics of biological science, evolution, and genomics (detailed in the last section of the project description). As the debates on evolution and intelligent design rage on and the public remains, in general, ignorant of new developments in science such as genomics, our planned activities will provide important opportunities for the public to understand how science works, what the facts of evolution are, what genomics is, and how our proposed research project advances scientific understanding. To disseminate our discoveries, a website hosting the data will be created. RESULTS FROM PRIOR NSF SUPPORT The PI, Shin-Han Shiu is a new investigator with no prior NSF support. However, Shiu has extensive collaboration with three research groups on their NSF-funded projects in the past four years. The first is with Richard Vierstra (University of Wisconsin) on E3 ubiquitin ligases (NSF Arabidopsis 2010; REF; Gingerich, Hanada, Shiu, and Vierstra, in preparation). The second is with Ming-Che Shih (University of Iowa) on the β-glucosidase and β-galactosidase gene family evolution (NSF Arabidopsis 2010; Shih and Shiu, submitted). The third is on the functional studies of plant receptor-like kinases with John Walker (Univ. of Missouri) and Frans Tax (Univ. of Arizona) funded by NSF. BACKGROUND Brief intro to: 1) Protein space, gene annotation, conserved regions, structural domains 2) Novel genes likely present in the genome 3) Protein domains/motifs and their uses in functional annotation, structural biology, and others. 4) Way to describe protein space – focus on breaking genes into non-overlapping families 5) Current domain database What’s missing: 1) Need to take into account the frequent domain fusion and fission events, so not global but local sequence similarity is what should be focused on. 2) Need to take into account fast evolving proteins by analyzing related species 3) Large number of proteins do not have domains. Significant proportion of the proteins are not covered by known domain/motifs Link to the objectives. Significance - annotation of plant genes - understanding gene family evolution - generation of hypothesis of gene function based on orthology Links to preliminary results PRELIMINARY STUDIES To determine the feasibility of the proposed methods, pilot studies have been conducted in X areas. RESEARCH PLAN We propose to …. The workflow is shown in Figure X. The experimental plans are detailed below. I. Classify plant proteins and identify orthologous gene sets Blah I-1. Clustering of plant protein families based on global sequence similarity Markov clustering - Multiple granularity setting - Break cluster into subclusters that are more alignable Tree building I-2. Significance of orphans Singletons or orphans - Fast evolving, with similarity to some plant sequences - Simply not annotated in other genomes - False positives I-3. Identification of novel coding sequences Novel gene - Coding potential - Evolutionary conservation - Tiling array data II. Cluster homologous regions of plant proteins into domain families II-1. Iterative similarity search based on local sequence similarity Iterative similarity search (Psi-BLAST) - Deal with repeat family - Low complexity - Signal sequences, transmembrane domain - Fragmentation – two sets, one with only annotation with cDNA as seeds, the other use everything. II-2. Populating statistical models and phylogenetic trees for domain families Position-Specific Score Matrix (PSSM) Hidden Markov Models (HMMs) - Evaluate redundancy against InterPro - Evaluate redundancy within the domain set Tree building - Deal with repeat family - Short sequences Broad dissemination: contact Uniprot? III. Identify and map functional information to orthologous groups Define orthologous group - phylogenetic approach – algorithm, bootstrap support, gene or domain family - synteny map Functional info - Expression data - Protein-protein interaction of model organisms and others IV. Annotate plant ESTs by gene family, domain family, and orthologous groups EST anchoring ESTs that cannot be anchors – assess coding potential, conservation, and characteristics of non-coding RNAs MANAGEMENT PLAN I. Work assignment and data exchange II Sharing of results and management of intellectual property How to sustain the website and analysis pipeline without assuming long-term NSF support. INTEGRATION OF RESEARCH AND EDUCATION I. Partnership with the East Lansing Public Library Shin-Han Shiu has established a partnership with Sylvia Marabate (Director), Julie Pierce (Head of Adult and Children’s Services), and Mary Hennessey (Young Adult Coordinator) of the East Lansing Public Library (ELPL) to develop activities aiming to enhance the general public’s understanding of plant science, evolution, and genomics using the proposed project as an example. In our partnership, the Shiu lab will provide scientific expertise and ELPL will devote personnel for event planning and space. ELPL has extensive experience in hosting outreach programs for all age groups and in attracting a broad audience in central Michigan. Since all current programs in ELPL focus on literature, theater, and fine arts, our planned science program will be a unique opportunity to educate the public about science. Our outreach program will have three components: Multimedia presentation: the PI will develop slideshows with ELPL staff to portray how science works, what the facts of evolution are, what genomics is, and how our proposed research project advances scientific understanding. Interactive discussion session: This component includes a question-answer session on the presentation subjects and interactive, hands-on activities that will allow participants to gain a better understanding of evolution as a scientific discipline. “Field trip”: The audience will tour the Shiu lab to find out how scientists work with a focus on how we carry out computational analyses and experimental studies. After the tour, the participants will have the opportunity to ask questions and provide feedback for improving future programs. According to the ELPL staff’s experience in working with the general public, we expect our outreach effort will reach an audience of approximately 100 teens/adults per year. We plan to host three workshops per year for two years. The planned activities will not only provide a channel for the public to see how our research plan contributes to science but also enhance their understanding of how the process of science works in general, fulfilling the NSF’s goal of broad dissemination to enhance scientific and technological understanding.