Project_desc_1002 - Shiu Lab

advertisement
PROJECT DESCRIPTION
I. RESULTS FROM PRIOR NSF SUPPORT
The PI, Shin-Han Shiu is a new investigator with no prior NSF support. However, the PI has had extensive
collaborations with at least four research groups on their NSF-funded projects in the past four years:
 Richard Vierstra (NSF 2010 project; University of Wisconsin-Madison): on the evolution and functions
of E3 ubiquitin ligases including F-box proteins, and MATH-BTB proteins.
 Ming-Che Shih (NSF 2010 project; University of Iowa): on the evolution of β-glucosidase and βgalactosidase families.
 John Walker (NSF; University of Missouri): on the functional studies of plant receptor-like kinases.
 Sara E. Patterson (NSF Plant Genome Research Program; University of Wisconsin-Madison) on the
evolution of polygalacturonases and the expression divergence among A. thaliana members and on the
evolution of ethylene binding proteins.
Publications and manuscripts generated through these collaborations (*: equal contribution, personnel from
PI lab bold-faced):
 Ahn, Y. O., Zheng, M., Winkel, B., Bevan, D. R., Esen, A., Shiu, S.-H., J., B., Peng, H.-P., Miller, J.,
Cheng, C.-L., et al. Functional genomic analysis of Arabidopsis thaliana glycoside hydrolase family
35. Submitted.
 Gagne, J. M., Downes, B. P., Shiu, S. H., Durski, A. M., and Vierstra, R. D. (2002). The F-box
subunit of the SCF E3 complex is encoded by a diverse superfamily of genes in Arabidopsis. Proc
Natl Acad Sci U S A 99, 11519-11524.
 Gingerich, D. J., Hanada, K., Shiu, S.-H., and Vierstra, R. D. Identification and analysis of the Rice
BTB superfamily; evidence for a large-scale, lineage-specific expansion of an E3 ubiquitin-ligase
target recognition subunit gene family in monocots. In preparation.
 Kim, J.*, Shiu, S.-H.*, Thoma, S., Li, W.-H., and Patterson, S. E. (2006). Patterns of expansion and
expression divergence in the plant polygalacturonase gene family. Genome Biol, In press.
 Wang, W., Esch, J. J., Shiu, S.-H., Agula, H., Binder, B. M., Chang, C., Patterson, S. E., and
Bleecker, A. B. (2006). Identification of Important Regions for Ethylene Binding and Signaling in the
Transmembrane Domain of the ETR1 Ethylene Receptor of Arabidopsis. Plant Cell, In press.
II RELEVANCE AND JUSTIFICATION
1. Overview & Objectives
The long-term goal of our research program is to understand conservation and divergence of
protein functions through the computational and evolutionary analyses of protein domain families in
eukaryotes. Currently, conserved regions representing domains in plant proteins are not well studied. As a
result, known domains in various databases can only provide a fragmentary description of plant protein
space. Since protein domains are the unit of protein evolution, a thorough description of plant domains
will not only improve annotations of plant proteins but also allow functional inference based on
evolutionary relationships between domain sequences. Although there is a large amount of plant sequence
information available, including Expressed Sequence Tags (ESTs), relatively little is known about the
functions of plant sequences. Therefore, the overall goal of this project is to improve annotations of plant
protein sequences and ESTs by: (1) identifying conserved protein domains in these sequences and (2)
transferring functional information from genes of model species such as Arabidopsis thaliana.
Specifically, we plan to identify plant protein domains, classify plant proteins and ESTs into domain
families, and use phylogenomics, that is, combining knowledge of gene function and the evolutionary
relationships between members of a domain family to infer functions of plant proteins and ESTs. We
have the following 4 aims:
1. Cluster homologous regions of plant proteins into domain families: Plant domain space is currently
not properly represented by known domains. We plan to systematically identify plant domain families
from annotated plant proteins and build statistical models of these plant domains for annotating plant
proteins and EST sequences.
2. Reconstruct domain family trees and attach functional information: To apply the phylogenomic
framework of functional inference, we will first uncover the relationships between proteins by
constructing domain sequence-based phylogenies. Three types of functional information will be
attached to the trees: ontologies, gene expression, and potential protein-protein interactors inferred
from knowledge of yeast and animal model systems.
3. Annotate plant ESTs with domain family definitions: To transfer functional information to plant
ESTs based on phylogenetic relationships, we will classify ESTs into domain families and anchor
these ESTs to functionally annotated trees (Aim 2). In addition, for ESTs that cannot be classified, we
will identify ESTs that are likely coding sequences for generating “EST domain families” and to
evaluate ESTs that are potentially derived from RNA genes.
4. Contruct the Domain database of Plant Proteins (DOPP): The above objectives will provide a
collection of novel protein domains facilitating annotation of plant proteins and genomes,
functional annotations from model plants or other eukaryotes, and establishing relationships
between ESTs and coding genes. A database storing the data generated will be essential to broad
dissemination in the plant research community.
The proposed activities are appropriate for the Plant Genome Research Program in building
resources that will be broadly used by the entire plant research community and in transferring knowledge
from model systems, including plant, fungi, and animals, to crops and/or economically important plant
species. We will generate plant domain models and create a comprehensive resource, DOPP, for the
research community to obtain and query plant domains. In addition, DOPP will also provide annotations
for plant proteins and ESTs with no known function by transferring ontologies, expression data, and
information on protein-protein interaction from model species. Therefore, this proposal can be considered
in the context of Tools and Resources for Plant Genome Research (TRPGR).
2. Background
As the sequences of multiple organisms become available, substantial efforts have been
devoted to associate structural domains and functional information with the sequences. The use of
protein domain information has been vital for identifying distant relationships between protein
sequences, for classification of the protein sequence space, and for functional inferences. Currently,
there are many domain databases available that allow automatic analysis of protein sequences by
identifying domain sequence signatures, such as CCD (Marchler-Bauer et al., 2005), Pfam (Bateman et
al., 2002), ProDom (Bru et al., 2005), SMART (Letunic et al., 2006), SUPERFAMILY (Madera et al.,
2004), and TIGRFAMs (Haft et al., 2003). These domain databases contain similar sets of protein
domains and are derived with somewhat different goals and target sequences. The SMART database is
mainly focused on signaling proteins. SUPERFAMILY has a comprehensive set of domain signatures
but only for proteins with known structures (Madera et al., 2004). ProDom attempts to identify all
conserved regions of sequences but with a strong bias towards prokaryote, fungal, and animal
sequences (Bru et al., 2005). TIGRFAMs mainly focuses on novel domains from prokaryotes (Bateman
and Haft, 2002). Pfam is by far the most comprehensive domain database with > 8000 entries that are
constructed with protein sequences from a wide range of organisms (Bateman et al., 2002).
Although these domain databases serve as important resources, there are several limitations on
using these databases for the annotation of plant proteins. First, some databases are constructed with an
attempt to capture remote sequence similarities between proteins from phylogenetically distant
organisms. Therefore, they tend to generate models for domains that are more conserved and miss
organism specific domains. For example, by focusing on bacterial sequences, many TIGRFAMs
models do not overlap with the general database Pfam (Bateman and Haft, 2002), indicating that the
identification of domains using relatively more closely related species will increase the protein space
coverage by domain models. The second limitation is a bias in sequence representation. There is
preferential detection of domains in sequences that are more similar to sequences used for building
domain models such as profile Hidden Markov Models (HMMs). Because fewer plant genomes have
been sequenced relative to other eukaryotes and prokaryotes, existing databases mostly use datasets
containing few plant protein sequences. For example, the Pfam HMM of Armadillo (ARM) repeat
contains 141 animal, 39 fungal, and 67 plant sequences, respectively. All of these sequences are from
genes with known functions. In our analysis of the plant ARM family, we found that the ARM model
can only describe ~50% of plant ARM repeats in 108 genes. Therefore, we ended up generating
additional HMMs for plant ARMs (Mudgil et al., 2004). The third limitation is that currently known
domains do not cover the plant protein space adequately. We found that 25-60% of predicted proteins
do not have any domain description and of those with at least one domain, ~50% are covered by
domains over ≤ 50% over the sequence length (see Preliminary Studies). These limitations of current
databases indicate that, to better understand the functions of plant proteins, a research program
focusing on identifying plant protein domains is of utmost importance.
In addition to annotating protein sequences by protein domain, another major goal of gene
annotation is to transfer functional information from model species to sequences with no known
function. Phylogenomics, the use of evolutionary relationships to improve functional prediction of
uncharacterized genes (Eisen, 1998; Eisen and Fraser, 2003; Sjolander, 2004), has been applied to
functional annotation of gene families (Eisen and Hanawalt, 1999; Shiu et al., 2004) and whole
genomes (Eisen et al., 2002). One of the key challenges in phylogenomic studies is domain shuffling
(Sjolander, 2004). It is well established that protein domains are the units of protein evolution
(Doolittle, 1995). Protein family members diverged from common ancestors not only by changing
individual amino acids but also by fusion and fission of protein domains. As a result, it has been
advocated that the most reasonable way for clustering and classifying proteins is by examining proteins
domains (Liu and Rost, 2003). With a thorough description of plant domains, evolutionary relationships
between plant protein sequences can be established for functional inference. Multiple plant genome
sequencing projects are either completed or near completion. In addition, over 9 million Expressed
Sequence Tags (ESTs) are available from more than 100 plant species, including not only from model
plants but also species with important agricultural, ecological, and/or evolutionary applications. With
this large amount of plant sequence information with no annotated functions, there is an ever
increasing need to consolidate and classify sequences to provide a framework for hypothesizing gene
functions based on evolutionary relationships and growing functional information from model
species.
The long-term goal in the PI’s laboratory is to understand functional divergence and
conservation among duplicate genes created via different duplication mechanisms. We have been using
evolutionary genomics approaches to determine the history of duplication and domain shuffling in
plant, animal, and fungal genes (Shiu and Bleecker, 2001; Shiu and Bleecker, 2001; Gagne et al., 2002;
Shiu and Bleecker, 2003; Shiu et al., 2004; Mudgil et al., 2004; Shiu et al., 2005; Shiu et al., 2006;
Samuel et al., 2006; Kim et al., 2006). As a result, we have extensive experience in identifying
conserved sequences as protein domains and in analyzing domain families to uncover salient features
pertaining to the evolutionary history and functional changes of genes and protein domains. Our
expertise is highly relevant to the proposed studies and will ensure the successful completion of the
project.
3. Justification
3-A.large number of annotated proteins do not have any Pfam domains
The first question we asked is how many annotated plant protein sequences are not covered by
Pfam domains. We queried the annotated protein sequences from 6 eukaryote species, including four land
plants and one green algae, against the > 8000 entries in the Pfam domain HMM collection (Table 1).
Among plants, the coverage is the best in Arabidopsis thaliana with 74.1% of proteins covered by at least
one Pfam domain, even better than the coverage of human (69.4%). Since the Arabidopsis thaliana
genome has been regarded as finished since 2000 (Arabidopsis_Genome_Initiative, 2000), its protein
sequences have been incorporated into several domain/motif databases to serve as the representative from
plants. It should be noted that polyploidization events occur at a much higher rate in plants compared to
most other eukaryotes (Blanc et al., 2000; Vision et al., 2000; Bowers et al., 2003; Blanc and Wolfe,
2004; Cui et al., 2006). Thus, plant domains should theoretically be easier to identify since the high rate
of polyploidization contributes to a higher proportion of plant genes that are derived from recent
duplications (Lockton and Gaut, 2005). Nonetheless, 25.9% of annotated protein sequences from A.
thaliana do not have a single domain. Worse, ~50% of protein sequences from other plant species do not
have any Pfam domain.
Table 1. Domain coverage of annotated eukaryote proteins
Total (T)
Exclude (E)1
Hit2
# of Domains3
Oryza sativa
55800
11684
18781 (42.6%)
2612
Arabidopsis thaliana
28544
5
21139 (74.19%)
2646
Populus trichocarpa
52681
1503
25769 (50.4%)
2920
Physcomitrella patens
39793
4
16400 (41.2%)
2933
Chlamydomonas reinhardii
11328
18
4649 (41.1%)
1685
Homo sapiens
23692
39
16413 (69.4%)
3179
Species
1. Domains related to transposons or retrotransposons are excluded.
2. Hit is the number of protein sequences excluding transposable elements that have at least one Pfam domains.
3. The number of unique domains identified in each organism.
3-B. Many plant annotated protein sequences without Pfam domains have clear homologs and show
signatures of purifying selection
The low Pfam domain coverage in non-Arabidopsis plant species has at least two causes. The first
is that many of these plant genes are simply false positive predictions or selfish elements. For example in
rice, the number of genes remains controversial and many annotated rice genes seem to be derived from
transposable elements (Cruveiller et al., 2004; Jabbari et al., 2004; Jabbari et al., 2004; Bennetzen et al.,
2004). Another explanation for the low coverage is that many plant conserved regions represent novel
protein domains. While it is difficult to completely exclude false positives and transposon-derived
sequences, it is expected that true coding sequences will have extensive sequence similarity and display
signatures of purifying selection. To see if some of the protein sequences without domains have related
sequences, we conducted a similarity search of protein sequences from A. thaliana, rice, and poplar with
BLAST (Altschul et al., 1997) and identified the top match to each sequence using a set of very
conservative criteria (Figure 1 legend). We found 26896 of 62610 sequences that do not have Pfam
domains satisfy these criteria. Remarkably, over 79.5% of these 26896 sequences have a homolog with ≥
50% sequence identity (Figure 1A), indicating that they contain highly conserved regions that are likely
undescribed protein domains.
Figure 1. Comparisons of sequence conservation and functional constraint between annotated protein sequences
with and without Pfam domains from three plant species. (A) Cumulative total of the numbers of sequence pairs with
increasing sequence identities. The pairs are identified from an all-against-all BLAST search with annotated protein sequences
from A. thaliana, rice, and poplar as both queries and subjects. The qualified pairs were defined as those with E value ≤ 1e-5,
sequence identity ≥ 30%, aligned region ≥ 70% of the shorter sequence, and aligned region ≥ 150 amino acids. Closed circle:
sequences with at least one Pfam domain. Open circle: sequences without a Pfam domain. (B) Frequency distribution of Ka/Ks
values of sequence pairs with and without Pfam domains (black and white bars, respectively). The same sequence pairs as in (A)
were used. Ka and Ks values were determined using PAML (Yang, 1997).
Because some of the conserved sequences may represent transposons instead of genes (Bennetzen
et al., 2004), we further evaluated if the annotated sequences without domains bear the hallmark of
protein sequences that experience purifying selection (or functional constraint). The strength of functional
constraint was measured by the ratio between non-synonymous and synonymous substitution rates (Ka
and Ks, respectively, Li, 1997): the stronger the functional constraint, the lower the Ka/Ks value. In
addition, sequences that are not constrained will have a Ka/Ks value close to 1. Using the 26896
sequences identified earlier, Ka/Ks values were determined and the resulting distributions are shown in
Figure 1B. Although the Ka/Ks value distribution of annotated sequences without domains is skewed
toward slightly higher Ka/Ks values than those for annotated sequences with Pfam domains, 95% of
sequences have Ka/Ks values significantly smaller than 1 based on a likelihood ratio test (Nekrutenko et
al., 2002). This finding indicates that many of the sequences without Pfam domains are very likely
functional.
It should be noted that we adopted very conservative criteria for identifying homologs that err
heavily on the side of not identifying true relatives. The requirement for over 70% length coverage and
alignment lengths of more than 150aa excludes many sequence pairs with local sequence similarity or that
share shorter motifs. Nonetheless, with these conservative criteria we still identify a large number of
annotated sequences containing conserved regions with functional constraints that should be described as
protein domains.
3-C. Significant proportions of protein sequences are not covered by Pfam domains
For proteins that have at least one Pfam domain, we asked the question what proportion of the
sequence length is covered by the domain(s). Among the plant proteins analyzed, each percent coverage
bin has a similar percent total number of sequences (Figure 2A). Importantly, ~50% of the protein
sequences are covered at only < 60% of their lengths by Pfam domains in all plants. Similar to sequences
without protein domains (Figure 1A), many domain-less regions in plant proteins covered by at least one
Pfam domains have readily identifiable homologous sequences. However, the functional constraints on
many of the domain-less regions are not as high as regions overlapping with known domains. As an
example, we show here the Ka/Ks values of DNA binding domains of 23 plant transcription factor
families and the domain-less regions from the same genes (Figure 2B). In all cases, the DNA binding
domains have much stronger functional constraints compared to the domain-less regions of the same
genes. This is consistent with the skew toward higher Ka/Ks values in proteins without domains
compared to those with domains (Figure 1B). These findings indicate that there is a consistent bias for
identifying regions with strong selective constraints as domains. As a result, regions that are relatively
fast evolving, although with readily identifiable homologs and significant functional constraints, tend to
be missed during the domain annotation process. Therefore, novel plant protein domains should be
identified based on the sequence conservation among plant sequences instead of conservation over
extremely long phylogenetic distances between for example eukaryotes and prokaryotes or plants and
animals.
Figure 2. Percent length coverage of eukaryote proteins by Pfam domains and examples of accelerated evolution.
(A) Frequency distribution of the numbers of plant sequences with different percent length coverage. The caption contains
abbreviations of species from Table 1. (B) Examples of regions in protein sequences with widely different functional constraints.
Left panel: comparison of Ka/Ks values in plant transcription factor families between DNA binding domains and regions outside
of the binding domains (closed and open circles, respectively). Circles indicate the mean values and the bars indicate 95%
confidence intervals. The domain family names follow those in Pfam. Right panel: number of sequence pairs in each transcription
factor family.
III. RESEARCH PLAN
We propose to (1) cluster homologous regions of plant proteins into domain families, (2) define domain
family trees and attach functional information, (3) annotate plant ESTs with domain family definitions
and functional information, and (4) construct the Domain database of Plant Proteins (DOPP). The overall
workflow is shown in Figure 3 (next page). The experimental plans are detailed below.
Figure 3. Research plan workflow. We plan to (1-A) identify conserved regions (A-F) in annotated plant protein sequences, (1B) build PSSMs and HMMs. The alignments for HMMs will be used to generate domain-based phylogenies (2-A). Functional
information will be associated with sequences in the trees and ancestral functions will be inferred for each node (white box).
With the domain models and phylogenies, we can then classify ESTs based on their domain contents and anchor ESTs to
phylogenetic trees for functional inference (3-A). ESTs that cannot be classified will be assessed if they belong to protein coding
genes or are potential RNA genes (3-B).
Aim 1. Cluster homologous regions of plant proteins into domain families
The current domain descriptions for plant proteins are incomplete with questionable sensitivity.
Therefore, our first objective is to exhaustively identify homologous regions of plant proteins resembling
protein domains and generate statistical models for query and annotation purposes.
1-A. Iterative search based on local sequence similarity
We have shown that regions in plant proteins that have no domain annotations frequently have
readily identifiable homologs but have reduced functional constraints (Figure 1, 2). To identify
conserved regions in these potentially fast evolving sequences, it is necessary to examine relatively
closely related organisms instead of those with long phylogenetic distances. Since one of our major
goals is to enrich the domain descriptions of plant proteins, we will start with annotated protein
sequences from finished or draft quality plant genomes listed in Table 1 (referred to as the plant
protein set). Because we will establish a pipeline to automate the process of new domain identification,
additional plant genomes can be incorporated as they become available. Our overall procedure for
novel domain identification is similar to the automated clustering pipeline utilized by ProDom database
(Bru et al., 2005) with the following steps:
1. Mask low complexity regions using the program SEG (Wootton and Federhen, 1996).
2. Identify the shortest sequence in the plant protein set and break it down further if there are
internal repeats. Treat the first repeat as the shortest sequence.
3. Use the shortest sequence to conduct PSI-BLAST (Altschul et al., 1997) search against the
plant protein set. PSI-BLAST is an iterative search program for identifying related sequences
by generating a Position-Specific Score Matrix (PSSM) in each iteration to improve search
space. PSI-BLAST will be run till convergence or stop after 10 iterations.
4. Treat all regions identified in the iterative search as a new “domain family” and update the
database by removing these regions.
5. Repeat steps 2-4 till no sequence is left.
Despite the similarity with ProDom procedures, we plan to implement additional quality control
measures to ensure the domains identified are as complete and accurate as possible. The reason for
starting with the shortest sequence is based on the assumption that the shortest sequence corresponds to
a domain. Since we cannot be sure all annotations are correct and pseudogenes may be present in the
plant protein set, the shortest sequence (S q) may be only part of a complete domain. Therefore, we will
include an additional step between 3 and 4 to check if the PSI-BLAST query sequence has missing
coding sequences. To accomplish this, we will use full length protein sequences of the PSI-BLAST hits
as queries to conduct translated similarity searches against the genomic region containing S q. If missing
coding sequences are found (identity of the translated genomic region neighboring S q ≥ identity
between Sq and its top match minus 5%), S q will be treated as a mis-annotation sequence and removed
from the database. The pipeline will restart at step 2.
Another potential problem associated with the PSI-BLAST search is “profile wandering” where
false positive sequences are included and true positives are excluded (George and Heringa, 2002). To
reduce profile wandering, we will use a rather stringent E-value threshold (1e-5) for PSI-BLAST
searches. However, because of this stringent threshold setting, it is expected that sequences that are
supposed to be in one domain family will be broken into multiple domain families. To identify
relationships between domain families, we will cluster the domain families together into superfamilies
as detailed in Aim I-B. Finally, some predicted genes from complex plant genomes are likely false
positives and represent full or remnants of transposable elements (Bennetzen et al., 2004). Therefore, it
is anticipated that some of the domain families identified may be derived from transposons or
retrotransposons.
1-B. Populating statistical models for domain families
The goals in this section are to generate statistical models for identified domains, to determine the
overlap between novel plant domains and those that are already known, and to group related domains
together into domain “superfamilies”.
(1). Domain models: Note that in the domain search pipeline we will not exclude known
domains. The inclusion of regions that belong to known domains will allow us to evaluate the efficiency
of the pipeline and to determine the degree of domain fragmentation. Another reason to include known
domains is to generate domain models that are more plant-specific. Nearly all domain databases focus on
conservation over long evolutionary distance or have significantly more non-plant sequences when
training domain statistical models. Thus, some of the models are not as sensitive for detecting plant
sequences. At the end of PSI-BLAST searches, a PSSM (Henikoff and Henikoff, 1997) describing the
probability distributions of amino acids in an alignment will be generated for each domain family. The
PSSMs have been used for identifying conserved domains/motifs in resources such as the NCBI
Conserved Domain Database (Marchler-Bauer et al., 2005). In addition to PSSM, another commonly used
statistical model for describing domain sequences is HMMs (Eddy, 1998). Since PSSM does not
necessarily out-perform HMM (Delorenzi and Speed, 2002), we will also generate an HMM for each
domain by (1) generating a multiple sequence alignment (MSA) of domain family members (details in
Aim 2-A) and (2) building a model of the MSA with HMMer (http://hmmer.janelia.org/). For both the
PSSM and HMM of a domain family, the cutoff scores will be defined by comparing the score
distributions of the true positives (members of the domain family) and random amino acid sequences with
the same composition and length distribution as members of the domain family in question. The cutoff
score will be the minimum of the score of the lowest scoring true positive or the score at 5% false positive
rate. These domain models will be referred to as the “plant domain” set.
(2) Determine overlap with known domains and anneal fragmented Plant Domains: The plant
domain set will be used to search against the plant protein set. Meanwhile, we will also use InterProScan
to annotate the plant protein set with InterPro domains (Zdobnov and Apweiler, 2001). InterPro compiles
domain information from various databases and is the most comprehensive protein domain description
available. Here we will examine regions with both plant domain and InterPro annotations to evaluate the
overlap in sequence coverage and to establish relationships between InterPro and plant domains for future
reference. In addition, it is known that domains identified through PSI-BLAST-based approaches tend to
be fragmentary (Liu and Rost, 2003). For different plant domains that are nested within an InterPro
domain, they will be regarded as a “domain contig”. For plant domains that do not overlap with InterPro,
they will be defined as being in the same domain contig if all plant protein sequences of the involved
domain families have the same domain composition. New PSSMs and HMMs will be generated for these
domain contigs that better represent the classical definition of protein domains.
(3) Overlap among Plant Domain models: Because a relatively stringent cutoff will be used for
identifying related sequences in the iterated similarity searches, it is anticipated that the Plant Domain set
will contain distinct entries that are related. To identify the relationships between plant domain entries, we
will conduct an all-against-all BLAST search with all domain sequences and use transformed E-values
to generate a similarity matrix for Markov Clustering (MCL; Van Dongen, 2000). Domain families in
the same cluster are regarded as members of a domain “superfamily”. New PSSM and HMMs will be
generated for these domain superfamilies.
After these quality control steps, the PSSMs and HMMs generated will be used for annotating
both plant proteins and ESTs (Aim 3) and will be available from the web interface, DOPP (Aim 4).
Aim 2. Reconstruct domain family trees and attach functional information
Domain sequences better approximate the units of protein evolution in the context of domain shuffling.
To provide a phylogenomic framework for further analyses of the domain evolutionary histories and for
inference of functions with related sequences, we plan to build trees for each domain family and anchor
three types of functional information onto the trees from model species.
2-A. Domain-based trees
(1) Multiple sequence alignments: For each domain and domain contig (a set of domains that are
common among all domain family members), a multiple sequence alignment will be generated using
three programs: MAFFT (Katoh et al., 2005), SPEM (Zhou and Zhou, 2005) and ProbCons (Do et al.,
2005). All three programs implement progressive alignment and iterative refinement algorithms that are
more accurate than other alignment software (Zhou and Zhou, 2005; Edgar and Batzoglou, 2006).
Depending on the benchmark set and level of sequence similarity, however, one may out-perform the
other two. We will use all three programs to account for the heterogeneity inherent to our dataset (in
sequence lengths, numbers, and similarities). For each domain, the alignments generated by each method
will be further refined with the block-multiple alignment refinement program, BMArefiner (Chakrabarti
et al., 2006). The BMArefiner has implemented objective functions for determining alignment quality
scores. The refined alignments with the highest quality score will be used for building HMMs (Aim 1-B)
and phylogenetic reconstruction in the following section.
(2) Phylogenetic inference: Two methods will be used to generate phylogenetic trees. The first is
neighbor-joining (NJ; Saitou and Nei, 1987). For each domain family, an NJ tree will be generated with
1000 bootstrap replicates where multiple substitutions are corrected using the Poisson distance and
alignment gaps are treated as missing characters. The second approach is maximum parsimony (MP).
Branch and bound tree searches will be performed based on the alignments of each domain family.
Parsimony ratchet algorithm will be used to iteratively reduce tree space until the shortest trees are found
(Nixon, 1999). A strict consensus tree will be generated if multiple equally parsimonious trees are
uncovered. The phylogenies generated will be integrated into the database (Aim 4) and will be used for
functional inference of each tree node (Aim 2-B) and plant ESTs (Aim 3-A).
Evolutionary history of genes is not necessarily tree-like because domain fusion/fission events
and gene conversion occur at a significant rate. In this context, the relationships between genes are
better captured in the context of domain families. Since most genes will have more than one domain,
we will treat the phylogenies for each domain family as multiple hypotheses for the relationships
between genes containing the domain in question, instead of just picking one tree. One potential
problem for domain-based trees is that some domains are too short and contain insufficient informative
characters for phylogenetic reconstruction. Another difficulty is presented by repeated domain families
with different repeat numbers among family members. Therefore, we will also use full length protein
sequences of members of a domain family to build gene trees to complement domain trees. Again, the full
length sequence tree will be treated as one hypothesis for the possible relationships between members of a
family.
2-B. Association of functional information with phylogenies
The goal in this section is to anchor functional information onto the phylogenetic trees for
functional inferences based on evolutionary relationships. In this phylogenomic framework, functions
of genes can be uncovered by examining functional data of genes from related model species. In
addition to attaching functional information to the trees, we will also infer the ancestral functions at
each node of the trees by the parsimony method (Figure 4A). This approach will provide a platform for
functional predictions of any sequence residing in any clade defined by its subtending node. The types
of functional information and how they will be incorporated into the trees are described below.
Figure 4. Inference of ancestral functions and physical interactions. (A). Inference of ancestral function. The example tree has
4 genes, g1-g4. Suppose there are 3 functions associated with these 4 genes. Each gene either performs a function (1) or not (0).
The ancestral functional state in node a, b, and c can be inferred using the parsimony principle where the numbers of functional
gain and loss events is minimized. For example we can infer the common ancestor of g1 and g2 may or may not have function A
but had function B and C. Arrow indicates the anchorage point of a EST unipeptide (ESTx). The putative functions of the
unipeptide will be that of the node c. (B). Interaction inference. Protein a1 interacts with a2-a4 and has orthologous relationships
with plant proteins p1 and p2. If p3-5 are orthologous to any of a2-4, then p3-5 are regarded as putative interactors of P1-2.
(1) Gene Ontology & Plant Ontology: Three types of functional data will be incorporated into
the trees. The first is Gene Ontology (GO) & Plant Ontology (PO): The GO project is a communitybased effort to generate and use ontologies to facilitate the biologically meaningful annotation of genes
and their products in a wide variety of organisms (Ashburner et al., 2000). For plants, both the TAIR and
Gramene databases are members of the GO consortium contributing gene annotations of A. thaliana
and grain species including rice. The PO (Pujar et al., 2006) consortium on the other hand develops
ontologies describing plant anatomical and growth and developmental stages for gene annotation in
angiosperms. Both GO and PO annotations will be associated with gene domains. For GOs, because of
the high error rate in categories that are not annotated based on experiments, we will only use
categories with an evidence code inferred from direct assay (IDA), mutant phenotype (IMP), expression
pattern (IEP), genetic interaction (IGI), and physical interaction (IPI).
(2) Expression data: Plant microarray data continue to grow quickly and there are more than
200 different experiments are available for A. thaliana in ArrayExpress (Parkinson et al., 2005). These
expression data can provide clues to the temporal and spatial distribution of transcripts for genes with
no known expression patterns. Instead of dealing with the equivalence between experiments from the
same plant or different plants, each experiment will be treated as an independent dataset. The
expression data will be associated with genes two different ways. First, instead of simply distinguishing
a gene as expressed or not expressed, we will make use of the expression level information by treating
expression as continuous character states and inferring ancestral conditions using Mesquite (Maddison
and Maddison, 2006). The second approach is to identify differentially expressed genes and assign upregulated genes with character state 1, down-regulated ones with -1, and genes without significant
changes as 0. This approach will only be applied to stress-related datasets. For each stress related
dataset, control and treatment groups (with each time point scored independently) will be normalized
and differential expression of genes will be detected using LIMMA (Smyth, 2004). Ancestral
expression patterns will be inferred based on parsimony.
(3) Interaction data: The protein-protein interaction data have increased substantially over the
past several years due to efforts of individual labs and several large scale interactome projects for yeast,
fly, C. elegans, and human (summarized in Gandhi et al., 2006). However, plant protein interaction data
remain scarce. Assuming that functional protein interactions are conserved in evolution, hypotheses
regarding the potential interaction partners in plants can be extended by mining data from the model
organism protein interaction datasets. Such interaction inference is conceptually identical to
phylogenomic inference and has been applied to inferring human protein interactions (Matthews et al.,
2001; Kemmer et al., 2005). To infer plant protein interactions, we will first to identify orthologous
groups between plant proteins and their fungal or animal counterparts with a tree-based approach (Shiu et
al., 2006). Then the putative interacting proteins will be mapped by the inferred orthogous relationships
(Figure 4B) based on the interaction data from the Database of Interacting Proteins (DIP; Salwinski et
al., 2004). Note that plants have diverged from fungi and animals long ago. Extreme lineage-specific
expansion that occurred in plants and/or fungi/animals (for example, the RLK/Pelle family with 1
member in fly and >600 in A. thaliana; Shiu and Bleecker, 2003) makes the inference less meaningful.
Therefore, the size of the orthologous group will be provided along with the predictions.
After the function assignment and ancestral function inference are completed, the phylogenies
with functional information will be used to address several outstanding questions concerning the fate of
duplicate genes, including (1) what type of domains tend to expand independently in various lineages
of land plants, (2) how fast does novel function(s) arise after gene duplication, and (3) what is the
predominate fate of duplicate genes, subfunctionalization or neofunctionalization? The functionally
annotated phylogenies will also be used to infer the functions of ESTs, as outlined in Aim 3.
Aim 3. Annotate plant ESTs with domain family definitions
Establishing the relationships between ESTs and protein sequences from model plant species is an
important step for generating hypotheses concerning EST functions. Therefore, we plan to anchor ESTs to
domain family trees for functional inferences. Because ESTs are partial transcripts of both protein coding
and RNA genes, we also plan to examine the coding potential and sequence conservation of ESTs that
cannot be classified to evaluate their potential to be RNA genes.
3-A. EST classification based on domain families
(1) Assignment to domain families: Substantial efforts have been devoted to identify sequence
similarity between ESTs and plant proteins in NSF funded resources such as Phytome (Hartmann et al.,
2006). For mapping ESTs to domain families, we will take the unipeptide, (hypothetical protein
sequences of ESTs) from Phytome to search against the plant domain set constructed in Aim 1-B with
HMMer. Some EST unipeptides will be readily classified into domain families. However, since ESTs and
even EST contigs are mostly partial transcripts, they may miss part of a perfectly good plant domain. To
reduce the number of unassigned regions in EST or EST contigs, the unipeptides also will be used to
search the plant protein set (defined in Aim 1-A) with BLAST to determine if any unipeptide maps to
only part of a domain in the top match plant protein. If so, these fragmented domains in ESTs will be
indicated but not processed any further. Unipeptide relies on proper assemblies of ESTs into contigs. To
verify any ambiguity in domain assignments due to misassembly or problems in identifying translated
regions, we will also take individual ESTs to search the plant protein set. The coordinates of matching
areas will be checked against the EST contig assembly information and domains in the matching plant
protein. Any discrepancy will be flagged and the EST contigs will be broken into smaller contigs to
ensure correct domain family assignment. New unipeptides derived from these smaller contigs will be
extracted for domain identification procedures mentioned above.
(2) Anchoring onto functionally annotated trees: Once unipeptides have been mapped to
domain families, the next phase will be anchoring unipeptides to the domain trees. For a domain sequence
with a unipeptide, its top match will be identified among annotated plant protein members of the domain
family. Then an iterative searching algorithm will be applied to identify a monophyletic group where the
maximum distance within the group is between the unipeptide and its most closely related plant protein
sequence (Shiu et al., 2005). In the trees of the domain family the unipeptide belongs to, the branch that
leads to the clade with all members of the monophyletic group is the anchorage point (Figure 4A, arrow),
and the putative function of unipeptides will be inferred from the node subtending the anchorage point.
3-B. Assessment of properties of ESTs that cannot be assgined
It is known that many ESTs do not have obvious sequence similarity to annotated protein
sequences (Vandepoele and Van de Peer, 2005) and will not be assigned to any domain family. We
found that some of these ESTs likely contain novel protein coding genes (Hanada et al., submitted). In
addition, several important studies have revealed the importance of RNA genes in many eukaryotes
including plants (Meyers et al., 2006). There are at least two reasons why it is important to distinguish
ESTs with coding sequences from those derived from RNA genes. Since many more plant species have
ESTs but few have sequenced genomes, many plant-specific domains likely will be identified from
EST datasets. In addition, a way to distinguish protein coding from RNA genes will better guide the
experimental efforts for understanding gene functions using ESTs.
(1) Coding potential of non-assigned ESTs: We have developed a simple but very efficient
measure of coding potential called Coding Index based on the differences in the nucleotide
compositions of coding and non-coding sequences (CI; Hanada & Shiu, submitted). We will use this
method to evaluate if an EST contains coding sequence and extract the putative coding sequences for
further comparative analysis. For each species, we will use introns+UTRs and the coding sequences of
published genes as training sets for non-coding and coding regions, respectively. The posterior
probability that a sequence is coding is calculated with an implementation of the Bayes’ theorem
incorporating nucleotide composition information of both coding and non-coding training sequences.
Regions that are ≥ 20aa and have a CI value higher than 95% of non-coding training sequences are
regarded as putative coding sequences. Because EST sequences tend to be of lower quality with higher
error rates, putative coding regions will be annealed together if they are separated by 1-2 bases in the
same orientation. All EST regions with significant CI values will be verified in (2).
(2) Relatives of putative coding sequences: To assess if the putative coding sequences are
subject to selective constraints in its non-synonymous sites, the Ka/Ks value will be determined for
each putative coding sequence and its top match and evaluated if it is significantly less than one
(Nekrutenko et al., 2002), indicating functional constraints at the coding sequence level. The top match
is identified within the putative coding sequence set with an identity threshold of 30%, and an
alignment length threshold of 20aa. Sequences with functional constraints will then be used for domain
identification using the pipeline outlined in Aim 1-A. The domains identified from these EST putative
coding sequences with evidence of functional constraints will be referred to as the plant EST domain
set. We will not build phylogenies for these EST domains since the main purpose of the phylogenies is
for functional inferences.
(3) Putative RNA genes: ESTs that do not contain putative coding sequences are likely derived
from RNA genes or ultranslated regions (UTRs) of protein coding transcripts. We will only attempt to
distinguish RNA genes from UTRs in A. thaliana and rice since both species have genomes sequenced
and a fair amount of full length cDNAs. If the ESTs are within a threshold distance away for known
coding regions, they will be regarded as UTRs. The threshold distance will be the number of
nucleotides ≤ 95% of the lengths 5’ or 3’ UTRs based on full length cDNAs. Since the main goal of the
proposed studies is for annotating plant protein space, we will explore only two aspects of these
potential RNA genes in. First, we are interested in determining if these putative RNA genes are also
identified by methods based on experiments (Lu et al., 2005). Second, we are interested in knowing if
some of these putative RNA genes form gene families and if related sequences show functional
constraints that can be evaluated by comparing their substitution rates to those of introns and intergenic
sequences. These preliminary studies will bring novel insights regarding the evolution of RNA genes
that are poorly understood at this point.
Aim 4. Construct the Domain database of Plant Proteins (DOPP)
The information generated from the first three aims will be very useful to the bioinformatics team of
genome projects annotating plant genomes or individual investigators interested in finding domain
content, evolutionary relationships, or functional inferences of their sequences of interest. Therefore,
our final goal is to construct a database, DOPP, for depositing the large amount of data generated by
the proposed studies and a query interface for browsing and searching the content of the database.
Below we provide details on the data types that will be available from the DOPP, the web interface, and
the implementation plan.
(1) Data types: The proposed project will generate plant domain sequences (Aim 1-B),
statistical models for domains and EST domains (Aim 1-B & Aim 3-B), sequence alignments (Aim 2A), phylogenies with functional annotations (Aim 2-B), ESTs/Unipeptides mapping information (Aim
3-A), putative coding sequences in ESTs that cannot be mapped to the trees (Aim 3-B), and ESTs
potentially derived from RNA genes (Aim 3-B).
(2) Interface design: DOPP will be constructed in a way that is similar to the Pfam interface
with a major difference in that DOPP will incorporate phylogenetic trees and EST information. Users
can retrieve all data types in bulk for large scale analyses or annotations by providing a list of names,
by specifying the organisms, or by specifying the data types
In addition, users can find the entry of interest by entering genome database specific gene names,
GenBank/EMBL/SWISSPROT accession numbers, or sequences. If the query sequence or keyword
entered matches what is in the database, they will be able to follow the entry identified to get
information on its domain content, placement on phylogenetic trees, inferred functions, and related
sequences. On the other hand, if the query sequence is not in the database, it will be queried against
domain models with HMMer and reverse PSI BLAST (for protein sequences) or Wise2 (for nucleotide
sequences) and the sequence will be anchored to phylogenies as described in Aim 3-A.
(3) Implementation plan: The implementaion of DOPP will be assisted by the Research
Technology Support Facility (RTSF) in Michigan State University (see Support Letter). RTSF has
dedicated personnel and expertise for web interface design and implementation for large-scale
genomics projects. Several of these projects are NSF funded activities (e.g. Galdieria genome
sequencing and Plastid 2010 project). Because DOPP will provide both pre-computed results and a
query interface for onsite computation, the website will be hosted on a three-node cluster with a storage
node for handling data storage and database queries and two computing nodes for onsite analysis. We
will use Apache web-server for the website and MySQL for the database. Data will be backed up
weekly and the cluster will be managed by a system administrator in the Plant Biology Department. We
plan to maintain the database past the funding period by optimizing our analysis pipelines so it has a
modular design for us to change components with minimal changes in our codes. We also plan to
contribute the DOPP models to InterPro database that is actively maintained with broad usage.
IV. INTEGRATION OF RESEARCH AND EDUCATION
1. Research Opportunities on Computational Genomics for High School and
Undergraduate Students
The proposed project directly involves training of graduate students and postdoctoral researchers in the
field of computational genomics. In addition to researchers that are directly involved in the project, the PI
has established collaborations with two organizations on the Michigan State University campus that will
provide training opportunities to high school and undergraduate students.
The PI is involved in the High School Honors Science, Mathematics & Engineering Program
(HSHSP, www.msu.edu/~hshsp/). HSHSP is a program that recruits high school sophomore or junior
students to MSU campus to work in a research laboratory. During the summer of 2006, an HSHSP
student worked on the identification of mis-annotated Arabidiopsis thaliana genes using missing protein
domains as the criteria. The student was able to learn programming in two weeks and at the same time
gain an understanding of the biological and computational problems involved in the project. In the
ensuing six weeks, he worked on the project and gave a presentation and a report at the end. Currently we
are finalizing the findings for a manuscript. During the funding period, we plan to recruit two high school
students from HSHSP per year to work on (1) the coverage of plant domains in? plant proteins, (2) the
degree of functional constraint on different domains, (3) transposable element related domains, and (4)
the rate of domain shuffling in plant proteins.
In addition, the PI is involved in the Research Training Program for Undergraduate Students in
Biological and Mathematical Sciences (UBM). The mission of UBM is to provide intense training at the
intersection of mathematical and biological sciences, including interdisciplinary educational opportunities
and research experiences. Students in the program are mostly majoring in quantitative sciences such as
Mathematics, Statistics, and Computer Science. Given the importance of interdisciplinary research in
biology, it is important to attract students with strong quantitative backgrounds to work on plant biology
problems. During the funding period, we plan to recruit two students at any given time from this program
to work on the proposed project. Specifically, the students will focus on (1) simulation studies on amino
acid sequence evolution in protein domains, (2) domain network analysis, and (3) machine learning
problems in combining heterogeneous functional information for annotations.
2. Partnership with the East Lansing Public Library
The PI has established a partnership with Sylvia Marabate (Director), Julie Pierce (Head of Adult and
Children’s Services), and Mary Hennessey (Young Adult Coordinator) of the East Lansing Public
Library (ELPL) to develop activities aiming to enhance the general public’s understanding of science,
evolution, and genomics using the proposed research project as an example. The importance of this
type of partnership is well supported by the conclusions of a recent NSF publication on the public
attitudes toward science and technology (National_Science_Board, 2004). Two-thirds of Americans do
not clearly understand the scientific process. Nearly 150 years after the Origin of Species, evolution is
still under attack, and only 53% of Americans believe that human beings developed from earlier species
of animals. The public in general is ignorant about the new developments in science, such as genomics.
These statistics suggest that an outreach program focused on the process of science, facts on
evolution, and the prospects of genomics will be an important effort to improve science literacy. In our
partnership, the PI’s lab will provide scientific expertise and ELPL will devote personnel for event
planning and space. ELPL has extensive experience in hosting outreach programs for all age groups
and in attracting a broad audience in central Michigan. Since all current programs in ELPL focus on
literature, theater, and fine arts, our planned science program will be a unique opportunity to educate
the public about science. The ELPL has several programs targeted towards different audiences. We
have submitted a proposal to the Plant Comparative Genome Sequencing Program with a similar
outreach component with a focus on adult audience. ELPL has a successful young adult program
introducing various subject matters to high school students. Here we will focus on problem solving
activities and hands-on experience in sequence comparison using MSU computer classrooms.
With the ELPL staff’s experience and reputation in working with the general public and their
commitment to work together with the PI, we expect our outreach effort will reach an audience of
approximately 100 teens/adults per year. The planned activities will not only provide a channel for the
public to see how our research plan contributes to science but also enhance their understanding of how
the process of science works in general, fulfilling the NSF’s goal of broad dissemination to enhance
scientific and technological understanding.
REFERENCES
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and
PSI-BLAST: a new generate of protein database search programs. Nucleic Acid Research 25: 3389-3402
Arabidopsis_Genome_Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis
thaliana. Nature 408: 796-815
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig
JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald
M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 25: 25-29
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M,
Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res 30: 276-280
Bateman A, Haft DH (2002) HMM-based databases in InterPro. Brief Bioinform 3: 236-245
Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W (2004) Consistent over-estimation of gene number in
complex plant genomes. Curr Opin Plant Biol 7: 732-736
Blanc G, Barakat A, Guyot R, Cooke R, Delseny M (2000) Extensive duplication an reshuffling in the
Arabidopsis genome. Plant Cell 12: 1093-1101
Blanc G, Wolfe KH (2004) Widespread paleopolyploidy in model plant species inferred from age distributions of
duplicate genes. Plant Cell 16: 1667-1678
Bowers JE, Chapman BA, Rong J, Paterson AH (2003) Unravelling angiosperm genome evolution by
phylogenetic analysis of chromosomal duplication events. Nature 422: 433-438
Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D (2005) The ProDom database of protein domain
families: more emphasis on 3D. Nucleic Acids Res 33: D212-215
Chakrabarti S, Lanczycki CJ, Panchenko AR, Przytycka TM, Thiessen PA, Bryant SH (2006) Refining
multiple sequence alignments with conserved core regions. Nucleic Acids Res 34: 2598-2606
Cruveiller S, Jabbari K, Clay O, Bernardi G (2004) Incorrectly predicted genes in rice? Gene 333: 187-188
Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan
K, Barakat A, Albert VA, Ma H, dePamphilis CW (2006) Widespread genome duplications throughout
the history of flowering plants. Genome Res 16: 738-749
Delorenzi M, Speed T (2002) An HMM model for coiled-coil domains and a comparison with PSSM-based
predictions. Bioinformatics 18: 617-625
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: Probabilistic consistency-based multiple
sequence alignment. Genome Res 15: 330-340
Doolittle RF (1995) The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287-314
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14: 755-763
Edgar RC, Batzoglou S (2006) Multiple sequence alignment. Curr Opin Struct Biol 16: 368-373
Eisen JA (1998) Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary
analysis. Genome Res 8: 163-167
Eisen JA, Fraser CM (2003) Phylogenomics: intersection of evolution and genomics. Science 300: 1706-1707
Eisen JA, Hanawalt PC (1999) A phylogenomic study of DNA repair genes, proteins, and processes. Mutat Res
435: 171-213
Eisen JA, Nelson KE, Paulsen IT, Heidelberg JF, Wu M, Dodson RJ, Deboy R, Gwinn ML, Nelson WC, Haft
DH, Hickey EK, Peterson JD, Durkin AS, Kolonay JL, Yang F, Holt I, Umayam LA, Mason T,
Brenner M, Shea TP, Parksey D, Nierman WC, Feldblyum TV, Hansen CL, Craven MB, Radune D,
Vamathevan J, Khouri H, White O, Gruber TM, Ketchum KA, Venter JC, Tettelin H, Bryant DA,
Fraser CM (2002) The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic,
anaerobic, green-sulfur bacterium. Proc Natl Acad Sci U S A 99: 9509-9514
Gagne JM, Downes BP, Shiu SH, Durski AM, Vierstra RD (2002) The F-box subunit of the SCF E3 complex is
encoded by a diverse superfamily of genes in Arabidopsis. Proc Natl Acad Sci U S A 99: 11519-11524
Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju
S, Periaswamy B, Mishra G, Nandakumar K, Shen B, Deshpande N, Nayak R, Sarker M, Boeke JD,
Parmigiani G, Schultz J, Bader JS, Pandey A (2006) Analysis of the human protein interactome and
comparison with yeast, worm and fly interaction datasets. Nat Genet 38: 285-293
George RA, Heringa J (2002) Protein domain identification and improved sequence similarity searching using PSIBLAST. Proteins 48: 672-681
Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31: 371373
Hartmann S, Lu D, Phillips J, Vision TJ (2006) Phytome: a platform for plant comparative genomics. Nucleic
Acids Res 34: D724-730
Henikoff S, Henikoff JG (1997) Embedding strategies for effective use of information from multiple sequence
alignments. Protein Sci 6: 698-705
Jabbari K, Cruveiller S, Clay O, Le Saux J, Bernardi G (2004) The new genes of rice: a closer look. Trends
Plant Sci 9: 281-285
Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence
alignment. Nucleic Acids Res 33: 511-518
Kemmer D, Huang Y, Shah SP, Lim J, Brumm J, Yuen MM, Ling J, Xu T, Wasserman WW, Ouellette BF
(2005) Ulysses - an application for the projection of molecular interactions across species. Genome Biol 6:
R106
Kim J, Shiu S-H, Thoma S, Li W-H, Patterson SE (2006) Patterns of expansion and expression divergence in the
plant polygalacturonase gene family. Genome Biol: In press
Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P (2006) SMART 5: domains in the context of genomes
and networks. Nucleic Acids Res 34: D257-260
Li W-H (1997) Molecular evolution. Sinauer Associates, Sunderland
Liu J, Rost B (2003) Domains, motifs and clusters in the protein universe. Curr Opin Chem Biol 7: 5-11
Lockton S, Gaut BS (2005) Plant conserved non-coding sequences and paralogue evolution. Trends Genet 21: 6065
Lu C, Tej SS, Luo S, Haudenschild CD, Meyers BC, Green PJ (2005) Elucidation of the small RNA component
of the transcriptome. Science 309: 1567-1569
Maddison WP, Maddison DR (2006) Mesquite: a modular system for evolutionary analysis. In, Ed 1.12.
http://mesquiteproject.org
Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J (2004) The SUPERFAMILY database in 2004:
additions and improvements. Nucleic Acids Res 32: D235-239
Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI,
Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M,
Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Bryant SH
(2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33: D192-196
Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M (2001) Identification of
potential interaction networks using sequence-based searches for conserved protein-protein interactions or
"interologs". Genome Res 11: 2120-2126
Meyers BC, Souret FF, Lu C, Green PJ (2006) Sweating the small stuff: microRNA discovery in plants. Curr
Opin Biotechnol 17: 139-146
Mudgil Y, Shiu SH, Stone SL, Salt JN, Goring DR (2004) A large complement of the predicted Arabidopsis
ARM repeat proteins are members of the U-box E3 ubiquitin ligase family. Plant Physiol 134: 59-66
National_Science_Board (2004) Chapter 7. Science and technology: public attitudes and understanding. In Science
and Engineering Indicator 2004, Arlignton, VA
Nekrutenko A, Makova KD, Li WH (2002) The K(A)/K(S) ratio test for assessing the protein-coding potential of
genomic regions: an empirical and simulation study. Genome Res 12: 198-202
Nixon KC (1999) The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis. Cladistics 15: 407-414
Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG,
Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma
A, Sansone S, Brazma A (2005) ArrayExpress--a public repository for microarray gene expression data at
the EBI. Nucleic Acids Res 33: D553-555
Pujar A, Jaiswal P, Kellogg EA, Ilic K, Vincent L, Avraham S, Stevens P, Zapata F, Reiser L, Rhee SY, Sachs
MM, Schaeffer M, Stein L, Ware D, McCouch S (2006) Whole Plant Growth Stage Ontology for
Angiosperms and its Application in Plant Biology. Plant Physiol
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol.
Biol. Evol. 4: 406-425
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting
Proteins: 2004 update. Nucleic Acids Res 32: D449-451
Samuel MA, Salt JN, Shiu S-H, Goring DR (2006) Multifunctional arm repeat domains in plants. Int J Cytol In
press
Shiu S-H, Bleecker AB (2001) Plant receptor-like kinase gene family: diversity, function, and signaling. Sci STKE
2001: RE22
Shiu S-H, Bleecker AB (2001) Receptor-like kinases from Arabidopsis form a monophyletic gene family related to
animal receptor kinases. Proc Natl Acad Sci U S A 98: 10763-10768.
Shiu S-H, Bleecker AB (2003) Expansion of the receptor-like kinase/Pelle gene family and receptor-like proteins in
Arabidopsis. Plant Physiol 132: 530-543
Shiu S-H, Karlowski WM, Pan R, Tzeng YH, Mayer KF, Li WH (2004) Comparative analysis of the receptorlike kinase family in Arabidopsis and rice. Plant Cell 16: 1220-1234
Shiu SH, Byrnes JK, Pan R, Zhang P, Li WH (2006) Role of positive selection in the retention of duplicate genes
in mammalian genomes. Proc Natl Acad Sci U S A 103: 2232-2236
Shiu SH, Shih MC, Li WH (2005) Transcription factor families have much higher expansion rates in plants than in
animals. Plant Physiol 139: 18-26
Sjolander K (2004) Phylogenomic inference of protein molecular function: advances and challenges.
Bioinformatics 20: 170-179
Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray
experiments. Stat Appl Genet Mol Biol 3: Article3
Van Dongen SM (2000) Graph clustering by flow simulation. Ph.D. University of Utrecht
Vandepoele K, Van de Peer Y (2005) Exploring the plant transcriptome through phylogenetic profiling. Plant
Physiol 137: 31-42
Vision TJ, Brown DG, Tanksley SD (2000) The origins of genomic duplications in Arabidopsis. Science 290:
2114-2117
Wootton JC, Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods
Enzymol 266: 554-571
Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci
13: 555-556.
Zdobnov EM, Apweiler R (2001) InterProScan--an integration platform for the signature-recognition methods in
InterPro. Bioinformatics 17: 847-848
Zhou H, Zhou Y (2005) SPEM: improving multiple sequence alignment with sequence profiles and predicted
secondary structures. Bioinformatics 21: 3615-3621
Download