PROJECT DESCRIPTION I. RESULTS FROM PRIOR NSF SUPPORT The PI, Shin-Han Shiu is a new investigator with no prior NSF support. However, the PI has had extensive collaborations with at least four research groups on their NSF-funded projects in the past four years: Richard Vierstra (NSF 2010 project; University of Wisconsin-Madison): on the evolution and functions of E3 ubiquitin ligases including F-box proteins, and MATH-BTB proteins. Ming-Che Shih (NSF 2010 project; University of Iowa): on the evolution of β-glucosidase and βgalactosidase families. John Walker (NSF; University of Missouri): on the functional studies of plant receptor-like kinases. Sara E. Patterson (NSF Plant Genome Research Program; University of Wisconsin-Madison) on the evolution of polygalacturonases and the expression divergence among A. thaliana members and on the evolution of ethylene binding proteins. Publications and manuscripts generated through these collaborations (*: equal contribution, personnel from PI lab bold-faced): Ahn, Y. O., Zheng, M., Winkel, B., Bevan, D. R., Esen, A., Shiu, S.-H., J., B., Peng, H.-P., Miller, J., Cheng, C.-L., et al. Functional genomic analysis of Arabidopsis thaliana glycoside hydrolase family 35. Submitted. Gagne, J. M., Downes, B. P., Shiu, S. H., Durski, A. M., and Vierstra, R. D. (2002). The F-box subunit of the SCF E3 complex is encoded by a diverse superfamily of genes in Arabidopsis. Proc Natl Acad Sci U S A 99, 11519-11524. Gingerich, D. J., Hanada, K., Shiu, S.-H., and Vierstra, R. D. Identification and analysis of the Rice BTB superfamily; evidence for a large-scale, lineage-specific expansion of an E3 ubiquitin-ligase target recognition subunit gene family in monocots. In preparation. Kim, J.*, Shiu, S.-H.*, Thoma, S., Li, W.-H., and Patterson, S. E. (2006). Patterns of expansion and expression divergence in the plant polygalacturonase gene family. Genome Biol, In press. Wang, W., Esch, J. J., Shiu, S.-H., Agula, H., Binder, B. M., Chang, C., Patterson, S. E., and Bleecker, A. B. (2006). Identification of Important Regions for Ethylene Binding and Signaling in the Transmembrane Domain of the ETR1 Ethylene Receptor of Arabidopsis. Plant Cell, In press. II RELEVANCE AND JUSTIFICATION 1. Overview & Objectives The long-term goal of our research program is to understand conservation and divergence of protein functions through the computational and evolutionary analyses of protein domain families in eukaryotes. Currently, conserved regions representing domains in plant proteins are not well studied. As a result, known domains in various databases can only provide a fragmentary description of plant protein space. Since protein domains are the unit of protein evolution, a thorough description of plant domains will not only improve annotations of plant proteins but also allow functional inference based on evolutionary relationships between domain sequences. Although there is a large amount of plant sequence information available, including Expressed Sequence Tags (ESTs), relatively little is known about the functions of plant sequences. Therefore, the overall goal of this project is to improve annotations of plant protein sequences and ESTs by: (1) identifying conserved protein domains in these sequences and (2) transferring functional information from genes of model species such as Arabidopsis thaliana. Specifically, we plan to identify plant protein domains, classify plant proteins and ESTs into domain families, and use phylogenomics, that is, combining knowledge of gene function and the evolutionary relationships between members of a domain family to infer functions of plant proteins and ESTs. We have the following 4 aims: 1. Cluster homologous regions of plant proteins into domain families: Plant domain space is currently not properly represented by known domains. We plan to systematically identify plant domain families from annotated plant proteins and build statistical models of these plant domains for annotating plant proteins and EST sequences. 2. Reconstruct domain family trees and attach functional information: To apply the phylogenomic framework of functional inference, we will first uncover the relationships between proteins by constructing domain sequence-based phylogenies. Three types of functional information will be attached to the trees: ontologies, gene expression, and potential protein-protein interactors inferred from knowledge of yeast and animal model systems. 3. Annotate plant ESTs with domain family definitions: To transfer functional information to plant ESTs based on phylogenetic relationships, we will classify ESTs into domain families and anchor these ESTs to functionally annotated trees (Aim 2). In addition, for ESTs that cannot be classified, we will identify ESTs that are likely coding sequences for generating “EST domain families” and to evaluate ESTs that are potentially derived from RNA genes. 4. Contruct the Domain database of Plant Proteins (DOPP): The above objectives will provide a collection of novel protein domains facilitating annotation of plant proteins and genomes, functional annotations from model plants or other eukaryotes, and establishing relationships between ESTs and coding genes. A database storing the data generated will be essential to broad dissemination in the plant research community. The proposed activities are appropriate for the Plant Genome Research Program in building resources that will be broadly used by the entire plant research community and in transferring knowledge from model systems, including plant, fungi, and animals, to crops and/or economically important plant species. We will generate plant domain models and create a comprehensive resource, DOPP, for the research community to obtain and query plant domains. In addition, DOPP will also provide annotations for plant proteins and ESTs with no known function by transferring ontologies, expression data, and information on protein-protein interaction from model species. Therefore, this proposal can be considered in the context of Tools and Resources for Plant Genome Research (TRPGR). 2. Background As the sequences of multiple organisms become available, substantial efforts have been devoted to associate structural domains and functional information with the sequences. The use of protein domain information has been vital for identifying distant relationships between protein sequences, for classification of the protein sequence space, and for functional inferences. Currently, there are many domain databases available that allow automatic analysis of protein sequences by identifying domain sequence signatures, such as CCD (Marchler-Bauer et al., 2005), Pfam (Bateman et al., 2002), ProDom (Bru et al., 2005), SMART (Letunic et al., 2006), SUPERFAMILY (Madera et al., 2004), and TIGRFAMs (Haft et al., 2003). These domain databases contain similar sets of protein domains and are derived with somewhat different goals and target sequences. The SMART database is mainly focused on signaling proteins. SUPERFAMILY has a comprehensive set of domain signatures but only for proteins with known structures (Madera et al., 2004). ProDom attempts to identify all conserved regions of sequences but with a strong bias towards prokaryote, fungal, and animal sequences (Bru et al., 2005). TIGRFAMs mainly focuses on novel domains from prokaryotes (Bateman and Haft, 2002). Pfam is by far the most comprehensive domain database with > 8000 entries that are constructed with protein sequences from a wide range of organisms (Bateman et al., 2002). Although these domain databases serve as important resources, there are several limitations on using these databases for the annotation of plant proteins. First, some databases are constructed with an attempt to capture remote sequence similarities between proteins from phylogenetically distant organisms. Therefore, they tend to generate models for domains that are more conserved and miss organism specific domains. For example, by focusing on bacterial sequences, many TIGRFAMs models do not overlap with the general database Pfam (Bateman and Haft, 2002), indicating that the identification of domains using relatively more closely related species will increase the protein space coverage by domain models. The second limitation is a bias in sequence representation. There is preferential detection of domains in sequences that are more similar to sequences used for building domain models such as profile Hidden Markov Models (HMMs). Because fewer plant genomes have been sequenced relative to other eukaryotes and prokaryotes, existing databases mostly use datasets containing few plant protein sequences. For example, the Pfam HMM of Armadillo (ARM) repeat contains 141 animal, 39 fungal, and 67 plant sequences, respectively. All of these sequences are from genes with known functions. In our analysis of the plant ARM family, we found that the ARM model can only describe ~50% of plant ARM repeats in 108 genes. Therefore, we ended up generating additional HMMs for plant ARMs (Mudgil et al., 2004). The third limitation is that currently known domains do not cover the plant protein space adequately. We found that 25-60% of predicted proteins do not have any domain description and of those with at least one domain, ~50% are covered by domains over ≤ 50% over the sequence length (see Preliminary Studies). These limitations of current databases indicate that, to better understand the functions of plant proteins, a research program focusing on identifying plant protein domains is of utmost importance. In addition to annotating protein sequences by protein domain, another major goal of gene annotation is to transfer functional information from model species to sequences with no known function. Phylogenomics, the use of evolutionary relationships to improve functional prediction of uncharacterized genes (Eisen, 1998; Eisen and Fraser, 2003; Sjolander, 2004), has been applied to functional annotation of gene families (Eisen and Hanawalt, 1999; Shiu et al., 2004) and whole genomes (Eisen et al., 2002). One of the key challenges in phylogenomic studies is domain shuffling (Sjolander, 2004). It is well established that protein domains are the units of protein evolution (Doolittle, 1995). Protein family members diverged from common ancestors not only by changing individual amino acids but also by fusion and fission of protein domains. As a result, it has been advocated that the most reasonable way for clustering and classifying proteins is by examining proteins domains (Liu and Rost, 2003). With a thorough description of plant domains, evolutionary relationships between plant protein sequences can be established for functional inference. Multiple plant genome sequencing projects are either completed or near completion. In addition, over 9 million Expressed Sequence Tags (ESTs) are available from more than 100 plant species, including not only from model plants but also species with important agricultural, ecological, and/or evolutionary applications. With this large amount of plant sequence information with no annotated functions, there is an ever increasing need to consolidate and classify sequences to provide a framework for hypothesizing gene functions based on evolutionary relationships and growing functional information from model species. The long-term goal in the PI’s laboratory is to understand functional divergence and conservation among duplicate genes created via different duplication mechanisms. We have been using evolutionary genomics approaches to determine the history of duplication and domain shuffling in plant, animal, and fungal genes (Shiu and Bleecker, 2001; Shiu and Bleecker, 2001; Gagne et al., 2002; Shiu and Bleecker, 2003; Shiu et al., 2004; Mudgil et al., 2004; Shiu et al., 2005; Shiu et al., 2006; Samuel et al., 2006; Kim et al., 2006). As a result, we have extensive experience in identifying conserved sequences as protein domains and in analyzing domain families to uncover salient features pertaining to the evolutionary history and functional changes of genes and protein domains. Our expertise is highly relevant to the proposed studies and will ensure the successful completion of the project. 3. Justification 3-A.large number of annotated proteins do not have any Pfam domains The first question we asked is how many annotated plant protein sequences are not covered by Pfam domains. We queried the annotated protein sequences from 6 eukaryote species, including four land plants and one green algae, against the > 8000 entries in the Pfam domain HMM collection (Table 1). Among plants, the coverage is the best in Arabidopsis thaliana with 74.1% of proteins covered by at least one Pfam domain, even better than the coverage of human (69.4%). Since the Arabidopsis thaliana genome has been regarded as finished since 2000 (Arabidopsis_Genome_Initiative, 2000), its protein sequences have been incorporated into several domain/motif databases to serve as the representative from plants. It should be noted that polyploidization events occur at a much higher rate in plants compared to most other eukaryotes (Blanc et al., 2000; Vision et al., 2000; Bowers et al., 2003; Blanc and Wolfe, 2004; Cui et al., 2006). Thus, plant domains should theoretically be easier to identify since the high rate of polyploidization contributes to a higher proportion of plant genes that are derived from recent duplications (Lockton and Gaut, 2005). Nonetheless, 25.9% of annotated protein sequences from A. thaliana do not have a single domain. Worse, ~50% of protein sequences from other plant species do not have any Pfam domain. Table 1. Domain coverage of annotated eukaryote proteins Total (T) Exclude (E)1 Hit2 # of Domains3 Oryza sativa 55800 11684 18781 (42.6%) 2612 Arabidopsis thaliana 28544 5 21139 (74.19%) 2646 Populus trichocarpa 52681 1503 25769 (50.4%) 2920 Physcomitrella patens 39793 4 16400 (41.2%) 2933 Chlamydomonas reinhardii 11328 18 4649 (41.1%) 1685 Homo sapiens 23692 39 16413 (69.4%) 3179 Species 1. Domains related to transposons or retrotransposons are excluded. 2. Hit is the number of protein sequences excluding transposable elements that have at least one Pfam domains. 3. The number of unique domains identified in each organism. 3-B. Many plant annotated protein sequences without Pfam domains have clear homologs and show signatures of purifying selection The low Pfam domain coverage in non-Arabidopsis plant species has at least two causes. The first is that many of these plant genes are simply false positive predictions or selfish elements. For example in rice, the number of genes remains controversial and many annotated rice genes seem to be derived from transposable elements (Cruveiller et al., 2004; Jabbari et al., 2004; Jabbari et al., 2004; Bennetzen et al., 2004). Another explanation for the low coverage is that many plant conserved regions represent novel protein domains. While it is difficult to completely exclude false positives and transposon-derived sequences, it is expected that true coding sequences will have extensive sequence similarity and display signatures of purifying selection. To see if some of the protein sequences without domains have related sequences, we conducted a similarity search of protein sequences from A. thaliana, rice, and poplar with BLAST (Altschul et al., 1997) and identified the top match to each sequence using a set of very conservative criteria (Figure 1 legend). We found 26896 of 62610 sequences that do not have Pfam domains satisfy these criteria. Remarkably, over 79.5% of these 26896 sequences have a homolog with ≥ 50% sequence identity (Figure 1A), indicating that they contain highly conserved regions that are likely undescribed protein domains. Figure 1. Comparisons of sequence conservation and functional constraint between annotated protein sequences with and without Pfam domains from three plant species. (A) Cumulative total of the numbers of sequence pairs with increasing sequence identities. The pairs are identified from an all-against-all BLAST search with annotated protein sequences from A. thaliana, rice, and poplar as both queries and subjects. The qualified pairs were defined as those with E value ≤ 1e-5, sequence identity ≥ 30%, aligned region ≥ 70% of the shorter sequence, and aligned region ≥ 150 amino acids. Closed circle: sequences with at least one Pfam domain. Open circle: sequences without a Pfam domain. (B) Frequency distribution of Ka/Ks values of sequence pairs with and without Pfam domains (black and white bars, respectively). The same sequence pairs as in (A) were used. Ka and Ks values were determined using PAML (Yang, 1997). Because some of the conserved sequences may represent transposons instead of genes (Bennetzen et al., 2004), we further evaluated if the annotated sequences without domains bear the hallmark of protein sequences that experience purifying selection (or functional constraint). The strength of functional constraint was measured by the ratio between non-synonymous and synonymous substitution rates (Ka and Ks, respectively, Li, 1997): the stronger the functional constraint, the lower the Ka/Ks value. In addition, sequences that are not constrained will have a Ka/Ks value close to 1. Using the 26896 sequences identified earlier, Ka/Ks values were determined and the resulting distributions are shown in Figure 1B. Although the Ka/Ks value distribution of annotated sequences without domains is skewed toward slightly higher Ka/Ks values than those for annotated sequences with Pfam domains, 95% of sequences have Ka/Ks values significantly smaller than 1 based on a likelihood ratio test (Nekrutenko et al., 2002). This finding indicates that many of the sequences without Pfam domains are very likely functional. It should be noted that we adopted very conservative criteria for identifying homologs that err heavily on the side of not identifying true relatives. The requirement for over 70% length coverage and alignment lengths of more than 150aa excludes many sequence pairs with local sequence similarity or that share shorter motifs. Nonetheless, with these conservative criteria we still identify a large number of annotated sequences containing conserved regions with functional constraints that should be described as protein domains. 3-C. Significant proportions of protein sequences are not covered by Pfam domains For proteins that have at least one Pfam domain, we asked the question what proportion of the sequence length is covered by the domain(s). Among the plant proteins analyzed, each percent coverage bin has a similar percent total number of sequences (Figure 2A). Importantly, ~50% of the protein sequences are covered at only < 60% of their lengths by Pfam domains in all plants. Similar to sequences without protein domains (Figure 1A), many domain-less regions in plant proteins covered by at least one Pfam domains have readily identifiable homologous sequences. However, the functional constraints on many of the domain-less regions are not as high as regions overlapping with known domains. As an example, we show here the Ka/Ks values of DNA binding domains of 23 plant transcription factor families and the domain-less regions from the same genes (Figure 2B). In all cases, the DNA binding domains have much stronger functional constraints compared to the domain-less regions of the same genes. This is consistent with the skew toward higher Ka/Ks values in proteins without domains compared to those with domains (Figure 1B). These findings indicate that there is a consistent bias for identifying regions with strong selective constraints as domains. As a result, regions that are relatively fast evolving, although with readily identifiable homologs and significant functional constraints, tend to be missed during the domain annotation process. Therefore, novel plant protein domains should be identified based on the sequence conservation among plant sequences instead of conservation over extremely long phylogenetic distances between for example eukaryotes and prokaryotes or plants and animals. Figure 2. Percent length coverage of eukaryote proteins by Pfam domains and examples of accelerated evolution. (A) Frequency distribution of the numbers of plant sequences with different percent length coverage. The caption contains abbreviations of species from Table 1. (B) Examples of regions in protein sequences with widely different functional constraints. Left panel: comparison of Ka/Ks values in plant transcription factor families between DNA binding domains and regions outside of the binding domains (closed and open circles, respectively). Circles indicate the mean values and the bars indicate 95% confidence intervals. The domain family names follow those in Pfam. Right panel: number of sequence pairs in each transcription factor family. III. RESEARCH PLAN We propose to (1) cluster homologous regions of plant proteins into domain families, (2) define domain family trees and attach functional information, (3) annotate plant ESTs with domain family definitions and functional information, and (4) construct the Domain database of Plant Proteins (DOPP). The overall workflow is shown in Figure 3 (next page). The experimental plans are detailed below. Figure 3. Research plan workflow. We plan to (1-A) identify conserved regions (A-F) in annotated plant protein sequences, (1B) build PSSMs and HMMs. The alignments for HMMs will be used to generate domain-based phylogenies (2-A). Functional information will be associated with sequences in the trees and ancestral functions will be inferred for each node (white box). With the domain models and phylogenies, we can then classify ESTs based on their domain contents and anchor ESTs to phylogenetic trees for functional inference (3-A). ESTs that cannot be classified will be assessed if they belong to protein coding genes or are potential RNA genes (3-B). Aim 1. Cluster homologous regions of plant proteins into domain families The current domain descriptions for plant proteins are incomplete with questionable sensitivity. Therefore, our first objective is to exhaustively identify homologous regions of plant proteins resembling protein domains and generate statistical models for query and annotation purposes. 1-A. Iterative search based on local sequence similarity We have shown that regions in plant proteins that have no domain annotations frequently have readily identifiable homologs but have reduced functional constraints (Figure 1, 2). To identify conserved regions in these potentially fast evolving sequences, it is necessary to examine relatively closely related organisms instead of those with long phylogenetic distances. Since one of our major goals is to enrich the domain descriptions of plant proteins, we will start with annotated protein sequences from finished or draft quality plant genomes listed in Table 1 (referred to as the plant protein set). Because we will establish a pipeline to automate the process of new domain identification, additional plant genomes can be incorporated as they become available. Our overall procedure for novel domain identification is similar to the automated clustering pipeline utilized by ProDom database (Bru et al., 2005) with the following steps: 1. Mask low complexity regions using the program SEG (Wootton and Federhen, 1996). 2. Identify the shortest sequence in the plant protein set and break it down further if there are internal repeats. Treat the first repeat as the shortest sequence. 3. Use the shortest sequence to conduct PSI-BLAST (Altschul et al., 1997) search against the plant protein set. PSI-BLAST is an iterative search program for identifying related sequences by generating a Position-Specific Score Matrix (PSSM) in each iteration to improve search space. PSI-BLAST will be run till convergence or stop after 10 iterations. 4. Treat all regions identified in the iterative search as a new “domain family” and update the database by removing these regions. 5. Repeat steps 2-4 till no sequence is left. Despite the similarity with ProDom procedures, we plan to implement additional quality control measures to ensure the domains identified are as complete and accurate as possible. The reason for starting with the shortest sequence is based on the assumption that the shortest sequence corresponds to a domain. Since we cannot be sure all annotations are correct and pseudogenes may be present in the plant protein set, the shortest sequence (S q) may be only part of a complete domain. Therefore, we will include an additional step between 3 and 4 to check if the PSI-BLAST query sequence has missing coding sequences. To accomplish this, we will use full length protein sequences of the PSI-BLAST hits as queries to conduct translated similarity searches against the genomic region containing S q. If missing coding sequences are found (identity of the translated genomic region neighboring S q ≥ identity between Sq and its top match minus 5%), S q will be treated as a mis-annotation sequence and removed from the database. The pipeline will restart at step 2. Another potential problem associated with the PSI-BLAST search is “profile wandering” where false positive sequences are included and true positives are excluded (George and Heringa, 2002). To reduce profile wandering, we will use a rather stringent E-value threshold (1e-5) for PSI-BLAST searches. However, because of this stringent threshold setting, it is expected that sequences that are supposed to be in one domain family will be broken into multiple domain families. To identify relationships between domain families, we will cluster the domain families together into superfamilies as detailed in Aim I-B. Finally, some predicted genes from complex plant genomes are likely false positives and represent full or remnants of transposable elements (Bennetzen et al., 2004). Therefore, it is anticipated that some of the domain families identified may be derived from transposons or retrotransposons. 1-B. Populating statistical models for domain families The goals in this section are to generate statistical models for identified domains, to determine the overlap between novel plant domains and those that are already known, and to group related domains together into domain “superfamilies”. (1). Domain models: Note that in the domain search pipeline we will not exclude known domains. The inclusion of regions that belong to known domains will allow us to evaluate the efficiency of the pipeline and to determine the degree of domain fragmentation. Another reason to include known domains is to generate domain models that are more plant-specific. Nearly all domain databases focus on conservation over long evolutionary distance or have significantly more non-plant sequences when training domain statistical models. Thus, some of the models are not as sensitive for detecting plant sequences. At the end of PSI-BLAST searches, a PSSM (Henikoff and Henikoff, 1997) describing the probability distributions of amino acids in an alignment will be generated for each domain family. The PSSMs have been used for identifying conserved domains/motifs in resources such as the NCBI Conserved Domain Database (Marchler-Bauer et al., 2005). In addition to PSSM, another commonly used statistical model for describing domain sequences is HMMs (Eddy, 1998). Since PSSM does not necessarily out-perform HMM (Delorenzi and Speed, 2002), we will also generate an HMM for each domain by (1) generating a multiple sequence alignment (MSA) of domain family members (details in Aim 2-A) and (2) building a model of the MSA with HMMer (http://hmmer.janelia.org/). For both the PSSM and HMM of a domain family, the cutoff scores will be defined by comparing the score distributions of the true positives (members of the domain family) and random amino acid sequences with the same composition and length distribution as members of the domain family in question. The cutoff score will be the minimum of the score of the lowest scoring true positive or the score at 5% false positive rate. These domain models will be referred to as the “plant domain” set. (2) Determine overlap with known domains and anneal fragmented Plant Domains: The plant domain set will be used to search against the plant protein set. Meanwhile, we will also use InterProScan to annotate the plant protein set with InterPro domains (Zdobnov and Apweiler, 2001). InterPro compiles domain information from various databases and is the most comprehensive protein domain description available. Here we will examine regions with both plant domain and InterPro annotations to evaluate the overlap in sequence coverage and to establish relationships between InterPro and plant domains for future reference. In addition, it is known that domains identified through PSI-BLAST-based approaches tend to be fragmentary (Liu and Rost, 2003). For different plant domains that are nested within an InterPro domain, they will be regarded as a “domain contig”. For plant domains that do not overlap with InterPro, they will be defined as being in the same domain contig if all plant protein sequences of the involved domain families have the same domain composition. New PSSMs and HMMs will be generated for these domain contigs that better represent the classical definition of protein domains. (3) Overlap among Plant Domain models: Because a relatively stringent cutoff will be used for identifying related sequences in the iterated similarity searches, it is anticipated that the Plant Domain set will contain distinct entries that are related. To identify the relationships between plant domain entries, we will conduct an all-against-all BLAST search with all domain sequences and use transformed E-values to generate a similarity matrix for Markov Clustering (MCL; Van Dongen, 2000). Domain families in the same cluster are regarded as members of a domain “superfamily”. New PSSM and HMMs will be generated for these domain superfamilies. After these quality control steps, the PSSMs and HMMs generated will be used for annotating both plant proteins and ESTs (Aim 3) and will be available from the web interface, DOPP (Aim 4). Aim 2. Reconstruct domain family trees and attach functional information Domain sequences better approximate the units of protein evolution in the context of domain shuffling. To provide a phylogenomic framework for further analyses of the domain evolutionary histories and for inference of functions with related sequences, we plan to build trees for each domain family and anchor three types of functional information onto the trees from model species. 2-A. Domain-based trees (1) Multiple sequence alignments: For each domain and domain contig (a set of domains that are common among all domain family members), a multiple sequence alignment will be generated using three programs: MAFFT (Katoh et al., 2005), SPEM (Zhou and Zhou, 2005) and ProbCons (Do et al., 2005). All three programs implement progressive alignment and iterative refinement algorithms that are more accurate than other alignment software (Zhou and Zhou, 2005; Edgar and Batzoglou, 2006). Depending on the benchmark set and level of sequence similarity, however, one may out-perform the other two. We will use all three programs to account for the heterogeneity inherent to our dataset (in sequence lengths, numbers, and similarities). For each domain, the alignments generated by each method will be further refined with the block-multiple alignment refinement program, BMArefiner (Chakrabarti et al., 2006). The BMArefiner has implemented objective functions for determining alignment quality scores. The refined alignments with the highest quality score will be used for building HMMs (Aim 1-B) and phylogenetic reconstruction in the following section. (2) Phylogenetic inference: Two methods will be used to generate phylogenetic trees. The first is neighbor-joining (NJ; Saitou and Nei, 1987). For each domain family, an NJ tree will be generated with 1000 bootstrap replicates where multiple substitutions are corrected using the Poisson distance and alignment gaps are treated as missing characters. The second approach is maximum parsimony (MP). Branch and bound tree searches will be performed based on the alignments of each domain family. Parsimony ratchet algorithm will be used to iteratively reduce tree space until the shortest trees are found (Nixon, 1999). A strict consensus tree will be generated if multiple equally parsimonious trees are uncovered. The phylogenies generated will be integrated into the database (Aim 4) and will be used for functional inference of each tree node (Aim 2-B) and plant ESTs (Aim 3-A). Evolutionary history of genes is not necessarily tree-like because domain fusion/fission events and gene conversion occur at a significant rate. In this context, the relationships between genes are better captured in the context of domain families. Since most genes will have more than one domain, we will treat the phylogenies for each domain family as multiple hypotheses for the relationships between genes containing the domain in question, instead of just picking one tree. One potential problem for domain-based trees is that some domains are too short and contain insufficient informative characters for phylogenetic reconstruction. Another difficulty is presented by repeated domain families with different repeat numbers among family members. Therefore, we will also use full length protein sequences of members of a domain family to build gene trees to complement domain trees. Again, the full length sequence tree will be treated as one hypothesis for the possible relationships between members of a family. 2-B. Association of functional information with phylogenies The goal in this section is to anchor functional information onto the phylogenetic trees for functional inferences based on evolutionary relationships. In this phylogenomic framework, functions of genes can be uncovered by examining functional data of genes from related model species. In addition to attaching functional information to the trees, we will also infer the ancestral functions at each node of the trees by the parsimony method (Figure 4A). This approach will provide a platform for functional predictions of any sequence residing in any clade defined by its subtending node. The types of functional information and how they will be incorporated into the trees are described below. Figure 4. Inference of ancestral functions and physical interactions. (A). Inference of ancestral function. The example tree has 4 genes, g1-g4. Suppose there are 3 functions associated with these 4 genes. Each gene either performs a function (1) or not (0). The ancestral functional state in node a, b, and c can be inferred using the parsimony principle where the numbers of functional gain and loss events is minimized. For example we can infer the common ancestor of g1 and g2 may or may not have function A but had function B and C. Arrow indicates the anchorage point of a EST unipeptide (ESTx). The putative functions of the unipeptide will be that of the node c. (B). Interaction inference. Protein a1 interacts with a2-a4 and has orthologous relationships with plant proteins p1 and p2. If p3-5 are orthologous to any of a2-4, then p3-5 are regarded as putative interactors of P1-2. (1) Gene Ontology & Plant Ontology: Three types of functional data will be incorporated into the trees. The first is Gene Ontology (GO) & Plant Ontology (PO): The GO project is a communitybased effort to generate and use ontologies to facilitate the biologically meaningful annotation of genes and their products in a wide variety of organisms (Ashburner et al., 2000). For plants, both the TAIR and Gramene databases are members of the GO consortium contributing gene annotations of A. thaliana and grain species including rice. The PO (Pujar et al., 2006) consortium on the other hand develops ontologies describing plant anatomical and growth and developmental stages for gene annotation in angiosperms. Both GO and PO annotations will be associated with gene domains. For GOs, because of the high error rate in categories that are not annotated based on experiments, we will only use categories with an evidence code inferred from direct assay (IDA), mutant phenotype (IMP), expression pattern (IEP), genetic interaction (IGI), and physical interaction (IPI). (2) Expression data: Plant microarray data continue to grow quickly and there are more than 200 different experiments are available for A. thaliana in ArrayExpress (Parkinson et al., 2005). These expression data can provide clues to the temporal and spatial distribution of transcripts for genes with no known expression patterns. Instead of dealing with the equivalence between experiments from the same plant or different plants, each experiment will be treated as an independent dataset. The expression data will be associated with genes two different ways. First, instead of simply distinguishing a gene as expressed or not expressed, we will make use of the expression level information by treating expression as continuous character states and inferring ancestral conditions using Mesquite (Maddison and Maddison, 2006). The second approach is to identify differentially expressed genes and assign upregulated genes with character state 1, down-regulated ones with -1, and genes without significant changes as 0. This approach will only be applied to stress-related datasets. For each stress related dataset, control and treatment groups (with each time point scored independently) will be normalized and differential expression of genes will be detected using LIMMA (Smyth, 2004). Ancestral expression patterns will be inferred based on parsimony. (3) Interaction data: The protein-protein interaction data have increased substantially over the past several years due to efforts of individual labs and several large scale interactome projects for yeast, fly, C. elegans, and human (summarized in Gandhi et al., 2006). However, plant protein interaction data remain scarce. Assuming that functional protein interactions are conserved in evolution, hypotheses regarding the potential interaction partners in plants can be extended by mining data from the model organism protein interaction datasets. Such interaction inference is conceptually identical to phylogenomic inference and has been applied to inferring human protein interactions (Matthews et al., 2001; Kemmer et al., 2005). To infer plant protein interactions, we will first to identify orthologous groups between plant proteins and their fungal or animal counterparts with a tree-based approach (Shiu et al., 2006). Then the putative interacting proteins will be mapped by the inferred orthogous relationships (Figure 4B) based on the interaction data from the Database of Interacting Proteins (DIP; Salwinski et al., 2004). Note that plants have diverged from fungi and animals long ago. Extreme lineage-specific expansion that occurred in plants and/or fungi/animals (for example, the RLK/Pelle family with 1 member in fly and >600 in A. thaliana; Shiu and Bleecker, 2003) makes the inference less meaningful. Therefore, the size of the orthologous group will be provided along with the predictions. After the function assignment and ancestral function inference are completed, the phylogenies with functional information will be used to address several outstanding questions concerning the fate of duplicate genes, including (1) what type of domains tend to expand independently in various lineages of land plants, (2) how fast does novel function(s) arise after gene duplication, and (3) what is the predominate fate of duplicate genes, subfunctionalization or neofunctionalization? The functionally annotated phylogenies will also be used to infer the functions of ESTs, as outlined in Aim 3. Aim 3. Annotate plant ESTs with domain family definitions Establishing the relationships between ESTs and protein sequences from model plant species is an important step for generating hypotheses concerning EST functions. Therefore, we plan to anchor ESTs to domain family trees for functional inferences. Because ESTs are partial transcripts of both protein coding and RNA genes, we also plan to examine the coding potential and sequence conservation of ESTs that cannot be classified to evaluate their potential to be RNA genes. 3-A. EST classification based on domain families (1) Assignment to domain families: Substantial efforts have been devoted to identify sequence similarity between ESTs and plant proteins in NSF funded resources such as Phytome (Hartmann et al., 2006). For mapping ESTs to domain families, we will take the unipeptide, (hypothetical protein sequences of ESTs) from Phytome to search against the plant domain set constructed in Aim 1-B with HMMer. Some EST unipeptides will be readily classified into domain families. However, since ESTs and even EST contigs are mostly partial transcripts, they may miss part of a perfectly good plant domain. To reduce the number of unassigned regions in EST or EST contigs, the unipeptides also will be used to search the plant protein set (defined in Aim 1-A) with BLAST to determine if any unipeptide maps to only part of a domain in the top match plant protein. If so, these fragmented domains in ESTs will be indicated but not processed any further. Unipeptide relies on proper assemblies of ESTs into contigs. To verify any ambiguity in domain assignments due to misassembly or problems in identifying translated regions, we will also take individual ESTs to search the plant protein set. The coordinates of matching areas will be checked against the EST contig assembly information and domains in the matching plant protein. Any discrepancy will be flagged and the EST contigs will be broken into smaller contigs to ensure correct domain family assignment. New unipeptides derived from these smaller contigs will be extracted for domain identification procedures mentioned above. (2) Anchoring onto functionally annotated trees: Once unipeptides have been mapped to domain families, the next phase will be anchoring unipeptides to the domain trees. For a domain sequence with a unipeptide, its top match will be identified among annotated plant protein members of the domain family. Then an iterative searching algorithm will be applied to identify a monophyletic group where the maximum distance within the group is between the unipeptide and its most closely related plant protein sequence (Shiu et al., 2005). In the trees of the domain family the unipeptide belongs to, the branch that leads to the clade with all members of the monophyletic group is the anchorage point (Figure 4A, arrow), and the putative function of unipeptides will be inferred from the node subtending the anchorage point. 3-B. Assessment of properties of ESTs that cannot be assgined It is known that many ESTs do not have obvious sequence similarity to annotated protein sequences (Vandepoele and Van de Peer, 2005) and will not be assigned to any domain family. We found that some of these ESTs likely contain novel protein coding genes (Hanada et al., submitted). In addition, several important studies have revealed the importance of RNA genes in many eukaryotes including plants (Meyers et al., 2006). There are at least two reasons why it is important to distinguish ESTs with coding sequences from those derived from RNA genes. Since many more plant species have ESTs but few have sequenced genomes, many plant-specific domains likely will be identified from EST datasets. In addition, a way to distinguish protein coding from RNA genes will better guide the experimental efforts for understanding gene functions using ESTs. (1) Coding potential of non-assigned ESTs: We have developed a simple but very efficient measure of coding potential called Coding Index based on the differences in the nucleotide compositions of coding and non-coding sequences (CI; Hanada & Shiu, submitted). We will use this method to evaluate if an EST contains coding sequence and extract the putative coding sequences for further comparative analysis. For each species, we will use introns+UTRs and the coding sequences of published genes as training sets for non-coding and coding regions, respectively. The posterior probability that a sequence is coding is calculated with an implementation of the Bayes’ theorem incorporating nucleotide composition information of both coding and non-coding training sequences. Regions that are ≥ 20aa and have a CI value higher than 95% of non-coding training sequences are regarded as putative coding sequences. Because EST sequences tend to be of lower quality with higher error rates, putative coding regions will be annealed together if they are separated by 1-2 bases in the same orientation. All EST regions with significant CI values will be verified in (2). (2) Relatives of putative coding sequences: To assess if the putative coding sequences are subject to selective constraints in its non-synonymous sites, the Ka/Ks value will be determined for each putative coding sequence and its top match and evaluated if it is significantly less than one (Nekrutenko et al., 2002), indicating functional constraints at the coding sequence level. The top match is identified within the putative coding sequence set with an identity threshold of 30%, and an alignment length threshold of 20aa. Sequences with functional constraints will then be used for domain identification using the pipeline outlined in Aim 1-A. The domains identified from these EST putative coding sequences with evidence of functional constraints will be referred to as the plant EST domain set. We will not build phylogenies for these EST domains since the main purpose of the phylogenies is for functional inferences. (3) Putative RNA genes: ESTs that do not contain putative coding sequences are likely derived from RNA genes or ultranslated regions (UTRs) of protein coding transcripts. We will only attempt to distinguish RNA genes from UTRs in A. thaliana and rice since both species have genomes sequenced and a fair amount of full length cDNAs. If the ESTs are within a threshold distance away for known coding regions, they will be regarded as UTRs. The threshold distance will be the number of nucleotides ≤ 95% of the lengths 5’ or 3’ UTRs based on full length cDNAs. Since the main goal of the proposed studies is for annotating plant protein space, we will explore only two aspects of these potential RNA genes in. First, we are interested in determining if these putative RNA genes are also identified by methods based on experiments (Lu et al., 2005). Second, we are interested in knowing if some of these putative RNA genes form gene families and if related sequences show functional constraints that can be evaluated by comparing their substitution rates to those of introns and intergenic sequences. These preliminary studies will bring novel insights regarding the evolution of RNA genes that are poorly understood at this point. Aim 4. Construct the Domain database of Plant Proteins (DOPP) The information generated from the first three aims will be very useful to the bioinformatics team of genome projects annotating plant genomes or individual investigators interested in finding domain content, evolutionary relationships, or functional inferences of their sequences of interest. Therefore, our final goal is to construct a database, DOPP, for depositing the large amount of data generated by the proposed studies and a query interface for browsing and searching the content of the database. Below we provide details on the data types that will be available from the DOPP, the web interface, and the implementation plan. (1) Data types: The proposed project will generate plant domain sequences (Aim 1-B), statistical models for domains and EST domains (Aim 1-B & Aim 3-B), sequence alignments (Aim 2A), phylogenies with functional annotations (Aim 2-B), ESTs/Unipeptides mapping information (Aim 3-A), putative coding sequences in ESTs that cannot be mapped to the trees (Aim 3-B), and ESTs potentially derived from RNA genes (Aim 3-B). (2) Interface design: DOPP will be constructed in a way that is similar to the Pfam interface with a major difference in that DOPP will incorporate phylogenetic trees and EST information. Users can retrieve all data types in bulk for large scale analyses or annotations by providing a list of names, by specifying the organisms, or by specifying the data types In addition, users can find the entry of interest by entering genome database specific gene names, GenBank/EMBL/SWISSPROT accession numbers, or sequences. If the query sequence or keyword entered matches what is in the database, they will be able to follow the entry identified to get information on its domain content, placement on phylogenetic trees, inferred functions, and related sequences. On the other hand, if the query sequence is not in the database, it will be queried against domain models with HMMer and reverse PSI BLAST (for protein sequences) or Wise2 (for nucleotide sequences) and the sequence will be anchored to phylogenies as described in Aim 3-A. (3) Implementation plan: The implementaion of DOPP will be assisted by the Research Technology Support Facility (RTSF) in Michigan State University (see Support Letter). RTSF has dedicated personnel and expertise for web interface design and implementation for large-scale genomics projects. Several of these projects are NSF funded activities (e.g. Galdieria genome sequencing and Plastid 2010 project). Because DOPP will provide both pre-computed results and a query interface for onsite computation, the website will be hosted on a three-node cluster with a storage node for handling data storage and database queries and two computing nodes for onsite analysis. We will use Apache web-server for the website and MySQL for the database. Data will be backed up weekly and the cluster will be managed by a system administrator in the Plant Biology Department. We plan to maintain the database past the funding period by optimizing our analysis pipelines so it has a modular design for us to change components with minimal changes in our codes. We also plan to contribute the DOPP models to InterPro database that is actively maintained with broad usage. IV. INTEGRATION OF RESEARCH AND EDUCATION 1. Research Opportunities on Computational Genomics for High School and Undergraduate Students The proposed project directly involves training of graduate students and postdoctoral researchers in the field of computational genomics. In addition to researchers that are directly involved in the project, the PI has established collaborations with two organizations on the Michigan State University campus that will provide training opportunities to high school and undergraduate students. The PI is involved in the High School Honors Science, Mathematics & Engineering Program (HSHSP, www.msu.edu/~hshsp/). HSHSP is a program that recruits high school sophomore or junior students to MSU campus to work in a research laboratory. During the summer of 2006, an HSHSP student worked on the identification of mis-annotated Arabidiopsis thaliana genes using missing protein domains as the criteria. The student was able to learn programming in two weeks and at the same time gain an understanding of the biological and computational problems involved in the project. In the ensuing six weeks, he worked on the project and gave a presentation and a report at the end. Currently we are finalizing the findings for a manuscript. During the funding period, we plan to recruit two high school students from HSHSP per year to work on (1) the coverage of plant domains in? plant proteins, (2) the degree of functional constraint on different domains, (3) transposable element related domains, and (4) the rate of domain shuffling in plant proteins. In addition, the PI is involved in the Research Training Program for Undergraduate Students in Biological and Mathematical Sciences (UBM). The mission of UBM is to provide intense training at the intersection of mathematical and biological sciences, including interdisciplinary educational opportunities and research experiences. Students in the program are mostly majoring in quantitative sciences such as Mathematics, Statistics, and Computer Science. Given the importance of interdisciplinary research in biology, it is important to attract students with strong quantitative backgrounds to work on plant biology problems. During the funding period, we plan to recruit two students at any given time from this program to work on the proposed project. Specifically, the students will focus on (1) simulation studies on amino acid sequence evolution in protein domains, (2) domain network analysis, and (3) machine learning problems in combining heterogeneous functional information for annotations. 2. Partnership with the East Lansing Public Library The PI has established a partnership with Sylvia Marabate (Director), Julie Pierce (Head of Adult and Children’s Services), and Mary Hennessey (Young Adult Coordinator) of the East Lansing Public Library (ELPL) to develop activities aiming to enhance the general public’s understanding of science, evolution, and genomics using the proposed research project as an example. The importance of this type of partnership is well supported by the conclusions of a recent NSF publication on the public attitudes toward science and technology (National_Science_Board, 2004). Two-thirds of Americans do not clearly understand the scientific process. Nearly 150 years after the Origin of Species, evolution is still under attack, and only 53% of Americans believe that human beings developed from earlier species of animals. The public in general is ignorant about the new developments in science, such as genomics. These statistics suggest that an outreach program focused on the process of science, facts on evolution, and the prospects of genomics will be an important effort to improve science literacy. In our partnership, the PI’s lab will provide scientific expertise and ELPL will devote personnel for event planning and space. ELPL has extensive experience in hosting outreach programs for all age groups and in attracting a broad audience in central Michigan. Since all current programs in ELPL focus on literature, theater, and fine arts, our planned science program will be a unique opportunity to educate the public about science. The ELPL has several programs targeted towards different audiences. We have submitted a proposal to the Plant Comparative Genome Sequencing Program with a similar outreach component with a focus on adult audience. ELPL has a successful young adult program introducing various subject matters to high school students. Here we will focus on problem solving activities and hands-on experience in sequence comparison using MSU computer classrooms. With the ELPL staff’s experience and reputation in working with the general public and their commitment to work together with the PI, we expect our outreach effort will reach an audience of approximately 100 teens/adults per year. The planned activities will not only provide a channel for the public to see how our research plan contributes to science but also enhance their understanding of how the process of science works in general, fulfilling the NSF’s goal of broad dissemination to enhance scientific and technological understanding. REFERENCES Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generate of protein database search programs. Nucleic Acid Research 25: 3389-3402 Arabidopsis_Genome_Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796-815 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25-29 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res 30: 276-280 Bateman A, Haft DH (2002) HMM-based databases in InterPro. Brief Bioinform 3: 236-245 Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W (2004) Consistent over-estimation of gene number in complex plant genomes. Curr Opin Plant Biol 7: 732-736 Blanc G, Barakat A, Guyot R, Cooke R, Delseny M (2000) Extensive duplication an reshuffling in the Arabidopsis genome. Plant Cell 12: 1093-1101 Blanc G, Wolfe KH (2004) Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16: 1667-1678 Bowers JE, Chapman BA, Rong J, Paterson AH (2003) Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422: 433-438 Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33: D212-215 Chakrabarti S, Lanczycki CJ, Panchenko AR, Przytycka TM, Thiessen PA, Bryant SH (2006) Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res 34: 2598-2606 Cruveiller S, Jabbari K, Clay O, Bernardi G (2004) Incorrectly predicted genes in rice? Gene 333: 187-188 Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A, Albert VA, Ma H, dePamphilis CW (2006) Widespread genome duplications throughout the history of flowering plants. Genome Res 16: 738-749 Delorenzi M, Speed T (2002) An HMM model for coiled-coil domains and a comparison with PSSM-based predictions. Bioinformatics 18: 617-625 Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15: 330-340 Doolittle RF (1995) The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287-314 Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14: 755-763 Edgar RC, Batzoglou S (2006) Multiple sequence alignment. Curr Opin Struct Biol 16: 368-373 Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8: 163-167 Eisen JA, Fraser CM (2003) Phylogenomics: intersection of evolution and genomics. Science 300: 1706-1707 Eisen JA, Hanawalt PC (1999) A phylogenomic study of DNA repair genes, proteins, and processes. Mutat Res 435: 171-213 Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, Dodson RJ, Deboy R, Gwinn ML, Nelson WC, Haft DH, Hickey EK, Peterson JD, Durkin AS, Kolonay JL, Yang F, Holt I, Umayam LA, Mason T, Brenner M, Shea TP, Parksey D, Nierman WC, Feldblyum TV, Hansen CL, Craven MB, Radune D, Vamathevan J, Khouri H, White O, Gruber TM, Ketchum KA, Venter JC, Tettelin H, Bryant DA, Fraser CM (2002) The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium. Proc Natl Acad Sci U S A 99: 9509-9514 Gagne JM, Downes BP, Shiu SH, Durski AM, Vierstra RD (2002) The F-box subunit of the SCF E3 complex is encoded by a diverse superfamily of genes in Arabidopsis. Proc Natl Acad Sci U S A 99: 11519-11524 Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B, Mishra G, Nandakumar K, Shen B, Deshpande N, Nayak R, Sarker M, Boeke JD, Parmigiani G, Schultz J, Bader JS, Pandey A (2006) Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 38: 285-293 George RA, Heringa J (2002) Protein domain identification and improved sequence similarity searching using PSIBLAST. Proteins 48: 672-681 Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31: 371373 Hartmann S, Lu D, Phillips J, Vision TJ (2006) Phytome: a platform for plant comparative genomics. Nucleic Acids Res 34: D724-730 Henikoff S, Henikoff JG (1997) Embedding strategies for effective use of information from multiple sequence alignments. Protein Sci 6: 698-705 Jabbari K, Cruveiller S, Clay O, Le Saux J, Bernardi G (2004) The new genes of rice: a closer look. Trends Plant Sci 9: 281-285 Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33: 511-518 Kemmer D, Huang Y, Shah SP, Lim J, Brumm J, Yuen MM, Ling J, Xu T, Wasserman WW, Ouellette BF (2005) Ulysses - an application for the projection of molecular interactions across species. Genome Biol 6: R106 Kim J, Shiu S-H, Thoma S, Li W-H, Patterson SE (2006) Patterns of expansion and expression divergence in the plant polygalacturonase gene family. Genome Biol: In press Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P (2006) SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 34: D257-260 Li W-H (1997) Molecular evolution. Sinauer Associates, Sunderland Liu J, Rost B (2003) Domains, motifs and clusters in the protein universe. Curr Opin Chem Biol 7: 5-11 Lockton S, Gaut BS (2005) Plant conserved non-coding sequences and paralogue evolution. Trends Genet 21: 6065 Lu C, Tej SS, Luo S, Haudenschild CD, Meyers BC, Green PJ (2005) Elucidation of the small RNA component of the transcriptome. Science 309: 1567-1569 Maddison WP, Maddison DR (2006) Mesquite: a modular system for evolutionary analysis. In, Ed 1.12. http://mesquiteproject.org Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 32: D235-239 Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Bryant SH (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33: D192-196 Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M (2001) Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 11: 2120-2126 Meyers BC, Souret FF, Lu C, Green PJ (2006) Sweating the small stuff: microRNA discovery in plants. Curr Opin Biotechnol 17: 139-146 Mudgil Y, Shiu SH, Stone SL, Salt JN, Goring DR (2004) A large complement of the predicted Arabidopsis ARM repeat proteins are members of the U-box E3 ubiquitin ligase family. Plant Physiol 134: 59-66 National_Science_Board (2004) Chapter 7. Science and technology: public attitudes and understanding. In Science and Engineering Indicator 2004, Arlignton, VA Nekrutenko A, Makova KD, Li WH (2002) The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res 12: 198-202 Nixon KC (1999) The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis. Cladistics 15: 407-414 Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma A (2005) ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 33: D553-555 Pujar A, Jaiswal P, Kellogg EA, Ilic K, Vincent L, Avraham S, Stevens P, Zapata F, Reiser L, Rhee SY, Sachs MM, Schaeffer M, Stein L, Ware D, McCouch S (2006) Whole Plant Growth Stage Ontology for Angiosperms and its Application in Plant Biology. Plant Physiol Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406-425 Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32: D449-451 Samuel MA, Salt JN, Shiu S-H, Goring DR (2006) Multifunctional arm repeat domains in plants. Int J Cytol In press Shiu S-H, Bleecker AB (2001) Plant receptor-like kinase gene family: diversity, function, and signaling. Sci STKE 2001: RE22 Shiu S-H, Bleecker AB (2001) Receptor-like kinases from Arabidopsis form a monophyletic gene family related to animal receptor kinases. Proc Natl Acad Sci U S A 98: 10763-10768. Shiu S-H, Bleecker AB (2003) Expansion of the receptor-like kinase/Pelle gene family and receptor-like proteins in Arabidopsis. Plant Physiol 132: 530-543 Shiu S-H, Karlowski WM, Pan R, Tzeng YH, Mayer KF, Li WH (2004) Comparative analysis of the receptorlike kinase family in Arabidopsis and rice. Plant Cell 16: 1220-1234 Shiu SH, Byrnes JK, Pan R, Zhang P, Li WH (2006) Role of positive selection in the retention of duplicate genes in mammalian genomes. Proc Natl Acad Sci U S A 103: 2232-2236 Shiu SH, Shih MC, Li WH (2005) Transcription factor families have much higher expansion rates in plants than in animals. Plant Physiol 139: 18-26 Sjolander K (2004) Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 20: 170-179 Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3: Article3 Van Dongen SM (2000) Graph clustering by flow simulation. Ph.D. University of Utrecht Vandepoele K, Van de Peer Y (2005) Exploring the plant transcriptome through phylogenetic profiling. Plant Physiol 137: 31-42 Vision TJ, Brown DG, Tanksley SD (2000) The origins of genomic duplications in Arabidopsis. Science 290: 2114-2117 Wootton JC, Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266: 554-571 Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13: 555-556. Zdobnov EM, Apweiler R (2001) InterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17: 847-848 Zhou H, Zhou Y (2005) SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 21: 3615-3621