Detecting Gene Duplications in the Human Lineage Yuval Itan , Kevin Bryson

advertisement
doi: 10.1111/j.1469-1809.2010.00609.x
Detecting Gene Duplications in the Human Lineage
Yuval Itan1,3∗ , Kevin Bryson2,3 and Mark G. Thomas1,3,4,5
1
Research Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, UK
2
Department of Computer Science, University College London, London WC1E 6BT, UK
3
CoMPLEX (Centre for Mathematics & Physics in the Life Sciences and Experimental Biology), University College London,
London WC1E 6BT, UK
4
AHRC (Centre for the Evolution of Cultural Diversity), Institute of Archaeology, University College London,
London WC1H 0PY, UK
5
Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala, University, Norbyvagen 18D, Uppsala, SE-752 36, Sweden
Summary
Gene duplications represent an important class of evolutionary events that is likely to have contributed to the unique human
phenotype in the short evolutionary time since the human-chimpanzee divergence. With the availability of both human
and chimpanzee genome drafts in high coverage re-sequencing assemblies and the high annotation quality of most human
genes, it should now be possible to identify all human lineage-specific gene duplication events (human inparalogues)
and a few pioneering studies have attempted to do that. However, the different levels of coverage in the human and
chimpanzee’s genomes assemblies, and the differing levels of gene annotation, have led to problematic assumptions and
oversimplifications in the algorithms and the datasets used to detect human lineage-specific gene duplications. In this
study, we have developed a set of bioinformatic tools to overcome a number of the conceptual problems that are prevalent
in previous studies and have collected a reliable and representative set of human inparalogues.
Keywords: Human inparalogues, gene duplication, bioinformatics method, human evolution
Introduction
The divergence of the human and chimpanzee lineages has
been estimated to have occurred between 4.30 and 7.02 million years ago (mya) (Stauffer et al., 2001; Kumar et al., 2005;
Steiper & Young, 2006; Patterson et al., 2006). In this relatively short evolutionary time, humans have acquired a range
of unique phenotypic features; notably obligate bipedalism
(Hunt, 1994), tripling of brain size (Nieuwenhuys et al.,
2007), and complex culture, including language (Enard et al.,
2002). Gene duplication is a class of genomic event that
is likely to have contributed to the evolution of modern
human phenotype. A seminal work on the fate of duplicated genes proposes that in most cases one copy maintains
the original functionality of the gene, while the other copy
“escapes” the constraint of purifying selection, and thus be∗
Corresponding author: Yuval Itan, Research Department of
Genetics, Evolution and Environment, University College London,
London WC1E 6BT, United Kingdom. Tel: +44-207-679-5034;
Fax: +44-207-679-5052; E-mail: y.itan@ucl.ac.uk
C
comes “free” to accumulate mutations that might give rise to
novel functionalities (neo-functionalisation) or loss of function (nonfunctionalisation) (Ohno, 1970). Later experiments
on duplicated gene expression levels have shown that extant
gene pairs might partition between them the functions of the
single ancestral gene (Prince & Pickett, 2002). The subfunctionalisation model (also called the duplication-degenerationcomplementation model) proposes that the two gene copies
acquire complementary loss of function, and together they
produce the full functionality of the ancestral gene (Force
et al., 1999).
The first step in identifying gene duplications in a specific
species and for a particular species pair is the identification
of corresponding orthologues in the reference species. This
allows a distinction to be made between “out-” and “in-” paralogues; duplications that happened before or after speciation,
respectively. However, until recently most studies focused on
paralogues, without the distinction between inparalogues and
outparalogues (Remm et al., 2001), while some studies combined paralogy with segmental duplications research (Bailey
& Eichler, 2006).
2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Annals of Human Genetics (2010) 74,555–565
555
Y. Itan et al.
The main bioinformatics tool available to identify species’
inparalogues is InParanoid (In-paralogue and Orthologue
Identification) (Remm et al., 2001; O’Brien et al., 2005),
an algorithm and a database that differentiates various species’
outparalogues from inparalogues. InParanoid is the only comprehensive bioinformatics tool that is designed specifically
to detect inparalogues (rather than paralogues in general),
which makes it a potentially ideal tool for detecting human
inparalogues. The algorithm identifies orthologues and inparalogues between any given pair of genomes (two species
comparisons only), while the software MultiParanoid detects
orthology and paralogy among multiple species, making it
conceptually more similar to COG (Clusters of Orthologues
Groups), the first platform created to identify large-scale clusters/groups of orthologues and paralogues (Berglund et al.,
2008). This is distinct from previous methods that identified smaller and separate sets of orthologues and paralogues
(Tatusov et al., 1997). Given the proteomes (in this case – exactly one protein from each coding gene for a nonredundant
dataset) of two given species, B and C, with a most recent
common ancestor, A, the InParanoid algorithm (Remm et al.,
2001; O’Brien et al., 2005) proceeds as follows: (1) find all
sequence pairwise similarities between B-C, C-B, B-B, and
C-C using the amino acid sequence similarity search tool
BLASTP (Altschul et al., 1990); (2) mark two-way best hits
as potential orthologues (these are termed seed-orthologues);
(3) add potential inparalogues for each seed-orthologues pair,
by assuming that two inparalogues (which are, by definition, from the same species) are closer to each other than
the distance between the seed-orthologues, otherwise the
gene duplication is assumed to be before the divergence
of B and C, and thus the two sequences are considered to
be outparalogues; (4) calculate confidence value scores for
the potential inparalogues using the averaged BLASTP bit
scores; and (5) resolve overlapping groups of orthologues and
inparalogues.
We have identified several problems associated with using InParanoid’s database of human-chimpanzee orthologues/inparalogues (O’Brien et al., 2005), and with attempting to use InParanoid locally with the human and chimpanzee
proteomes (Hubbard et al., 2009). In the Results section, we
detail the problems that we have encountered in the different
categories of the genomes’ annotations and the InParanoid
database. In summary, the use of a nonmodel organism’s proteome (while for the purpose of this study we define “model
organism” as species with an experimentally validated annotation) as one of the species when performing inparalogue
prediction using InParanoid, or any other published inparalogue prediction algorithm, underestimates the nonmodel
organism’s inparalogue count. This, and the other problems
that we have presented above, present major obstacles in detecting human lineage-specific gene duplications, and have
556
Annals of Human Genetics (2010) 74,555–565
necessitated the development of a new algorithm. In this
work, we describe an algorithm to detect lineage-specific duplications in any case where one genome is that of a model
organism, while the other is nonmodel (Fig. 1). The algorithm first filters the data by removing ambiguous sequences
and resolving gene overlaps, then selects human inparalogue
candidates using InParanoid and the mouse proteome. The
inparalogue candidates are used to identify potential chimpanzee orthologues and human inparalogues on chimpanzee
and human genomes, respectively. The full duplication lengths
of these candidates are identified, and phylogenetic trees inferred, while removing topologies that indicate human outparalogues and filtering for molecular clock violations. The
final step of the algorithm is the removal of gene conversion
(GC) candidates to provide a human inparalogue candidate
set. The algorithms are implemented in scripts that are available from the authors on request. We perform this analysis and
report a set of human-specific gene duplications that have occurred since the human-chimpanzee split.
Materials and Methods
The algorithm we present (Fig. 1) requires the availability of genome assemblies for both species of interest and
the proteome of an outgroup model organism. The algorithm was implemented using the programming language Perl
(http://www.perl.org/). Other applications that were used will
be described in the relevant sections.
Choosing an Outgroup and Filtering Data
The first part of the algorithm is to identify potential inparalogues between the two model organisms using proteome data
and InParanoid software. The first model organism is one that
is the target species – in this case human – and the second
should be the phylogenetically most closely related model organism with an available annotated genome. In this case, mouse
is the most closely related model organism to human (Waterston
et al., 2002; Benton & Donoghue, 2007). Using the biological
data mining website BioMart (Smedley et al., 2009), the following Ensembl annotation features for all 21,388 human and
23,019 mouse protein-coding genes were obtained: (1) chromosome number/symbol; (2) start location; (3) end location. Genes
were removed if the chromosome’s symbol was either ambiguous or indicated haplotype data. Also genes were removed if they
were overlapping, following the logical rule:
IF(chri = = chri−1 AND strandi = = strandi−1 AND starti
< endi−1 ) THEN genei filtered.
This rule ensures that if two genes are located on the same
chromosome and the start location of one gene is located within
the other gene then it is classed as overlapping and thus removed.
C 2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Detecting Gene Duplications in Humans
This step was repeated until there were zero overlaps. From each
human and mouse nonambiguous and nonoverlapping known
coding gene, the longest peptide sequence was obtained. Altogether, the final set of acquired peptides representing the human
and mouse proteomes consisted of 18,522 human and 21,043
mouse amino acid sequences; the above filtering process removed
2866 human peptides and 1976 mouse peptides.
Human-Mouse InParanoid Analysis
As mentioned above, InParanoid (Remm et al., 2001; O’Brien
et al., 2005) will not collect a reliable list of human inparalogues
with the human and the projected chimpanzee proteomes. However, InParanoid provides a very robust and accurate platform
for detecting inparalogues among two model organisms such
as human and mouse (van Noort et al., 2003). Performing an
InParanoid analysis with the 18,522 human peptide sequences
and 21,043 mouse peptide sequences resulted in 16,227 clusters of human-mouse seed orthologues, 305 containing one or
more human inparalogues (occurring since the human-mouse
divergence around 61.5 mya (Benton & Donoghue, 2007). The
human-chimpanzee human inparalogue set should nest within
this set of 305 genes.
Human-Chimpanzee BLAT Analysis
Figure 1 The filtering and analyses
stages in the human lineage gene
duplication detecting algorithm. Each
stage provides the input for the next
stage, while the initial input is the full
human and mouse proteomes. H1 and
H2 represent the human orthologue
and human inparalogue candidate
sequences, respectively.
C
BLAT (BLAST Like Alignment Tool) is a software tool that
identifies highly similar DNA or peptide sequences in a database,
such as a full genome assembly (Kent, 2002). To detect gene
duplications, the cDNA sequence was used to search the human and chimpanzee genomes (the chimpanzee’s assembly has
a coverage of ×6, a reasonably high level for a nonmodel organism). For each of the 305 human-mouse clusters containing
one or more human inparalogues, the cDNA sequence of the
human seed-orthologue peptide was acquired using BioMart
(Smedley et al., 2009), then BLAT was used to identify the
chimpanzee orthologues and human paralogues. BLAT’s characteristics are tailored to identify DNA sequence duplications
on genomes with a high degree of similarity. This makes BLAT
suited for comparisons of closely related species such as human
and chimpanzee, and consequently suited for finding human inparalogues, which are assumed to have a smaller distance from
their human paralogue than the distance between the human
and chimpanzee orthologues. The BLAT run of the 305 human
cDNA sequences against the human and chimpanzee genomes
on the UCSC web server was automated using a Perl script
that is available at: http://genomewiki.ucsc.edu/index.php/
Image:BlatBot_pl.txt (importantly, Ensembl and UCSC have, at
the time of writing, similar genome assemblies for human and
chimpanzee, so there was no discrepancy between the use of both
Ensembl and UCSC systems in this work). The chimpanzee orthologues and human inparalogue candidates were collected from
all BLAT hits by applying the following criteria:
2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Annals of Human Genetics (2010) 74,555–565
557
Y. Itan et al.
(1) Highest bit scores (which the BLAT algorithm uses to determine the best match). For human inparalogues detection, a
minimum threshold of half of that of the best hit was applied
(i.e., a hit was considered as a paralogue if its score was more
than half of the best hit).
(2) Sequence length similarity of at least 50%, since local alignment may capture various regions of the BLATed cDNA
sequence scattered on huge regions of the chromosome.
The human inparalogue candidates were then filtered for overlaps to avoid redundancies in the BLAT hits.
Finding the Full Extent of Human
Duplicated Regions
The human cDNA sequences collected using the above protocol will usually represent only a portion of the actual segmental sequence duplicated. To find the full extent of each duplicated region, the Ensembl Perl API interface (http://www.
ensembl.org/info/data/api.html) was employed. The dataset was
divided into triplets of (1) human orthologue, (2) human inparalogue candidate, and (3) chimpanzee orthologue. Upstream
from the start of each of the three sequences, sliding windows of 100 base pair slices were obtained and compared to
each other. In case, there was a similarity greater than 90%
(a heuristic value, greater than the similarity between two random sequences and lower than the expected 95%–100% human inparalogues/human-chimpanzee orthologues comparisons
(Britten, 2002; Mikkelsen et al., 2005) another 100 base pair
slice upstream of the previous slice was obtained and the same
similarity check was made. The window continued its upstream
slide until similarity dropped below 90%. The same process was
performed downstream of the end of each human and chimpanzee gene sequence. Importantly, as genomes are represented
by only one strand, whenever a sequence was on the opposite strand the complementary sequence was inferred and the
upstream–downstream directions were reversed. By checking for
nonoverlap between the extended tandem duplications, the full
human duplications and their full length chimpanzee orthologue
DNA sequences were obtained.
Alignment, Phylogenetic Trees, and Molecular
Clock Testing
At this stage, the human-human-chimpanzee orthologue set contains inparalogue candidates, as they were identified as potential
human-chimpanzee inparalogues only by comparison of human
and mouse proteomes. The human-mouse divergence has been
estimated to have occurred some ∼61.5 mya, while the humanchimpanzee split occurred (according to a recent estimate) some
∼6.6 mya (Steiper & Young, 2006), so the majority of duplications identified at this stage are expected to be outparalogues
with respect to human-chimpanzee divergence. Phylogenetic
tree inference offers a means of categorising homology types,
although it is very computationally demanding to infer phyloge-
558
Annals of Human Genetics (2010) 74,555–565
nies from full genome data (Koonin, 2005; Yang, 2006; Altenhoff
& Dessimoz, 2009). However, the InParanoid clustering and various filtering procedures described above reduce the number of
potential inparalogue candidate sets to a few hundred humanhuman-chimpanzee gene triplets and this has made possible the
differentiation of human inparalogues from outparalogues using a
phylogenetic approach. The first step in any phylogenetic inference is multiple sequence alignment. The software we chose to
do this was MAFFT – Multiple sequence Alignment employing
Fast Fourier Transform (Katoh et al., 2002). The advantage of this
method is its flexibility and reliability tradeoff as it automatically
optimises the alignment according to the lengths of the different
DNA sequences used. This is particularly useful when alignments need to be performed on sequences from a few hundred
to hundreds of thousands of base pairs in length. A benchmark
test (Katoh et al., 2005) has shown high performance of MAFFT
when compared to other well-established methods, including
MUSCLE (Edgar, 2004), T-Coffee (Notredame et al., 2000),
and ClustalW (Thompson et al., 1994) (note that the benchmark was, as with most alignment benchmarks, testing protein
alignments. In manual testing that we performed MAFFT also
showed favourable performance for DNA sequences). We automated the application of the DNAML and DNAMLK maximum
likelihood phylogeny inference programs, which are a part of the
Phylip package (Felsenstein, 1989), with the aligned sequences as
input. DNAMLK assumes a molecular clock, whereas DNAML
does not. Both tree topology and maximum likelihood score were
obtained from each set of sequences. The first filtering process
kept only the trees with the clock tree topology ([H1,H2],C1)
– representing two human inparalogues and their chimpanzee
orthologue (see Fig. 2 for the three possible tree topologies).
Then a likelihood ratio test of the molecular clock was applied
(Felsenstein, 1981), as follows: L R = 2(L R1 − L R0 ), where
ML1 is the DNAMLK (clock) log maximum likelihood score and
ML0 is the DNAML (no clock) log maximum likelihood score. In
cases where 2L R > 3.84 (p-value < 0.05 for a χ 2 distribution
with 1 degree of freedom: df = s − 2 where s is the number of
sequences) the molecular clock was considered to be violated.
Gene Conversion
GC may cause true outparalogues to be categorised as inparalogues as one of its main effects is to reduce sequence
difference between adjacent paralogues. However, an important feature of gene conversion is that it occurs more frequently in sequences with high Guanine Cytosine content.
Various studies give the range of 60%–90% (Galtier et al.,
2001; Galtier, 2003; Marais, 2003; Spencer et al., 2006; Chen
et al., 2007). Combining this with the observation that gene
converted sequences are usually located a short genetic distance from each other (Chen et al., 2007), it was possible to construct, to a first order of approximation, a GC
filter. The GC detection software GENECONV (Sawyer,
1989) (http://www.math.wustl.edu/∼sawyer/geneconv) is the
most well-established computational tool for this purpose.
However, GENECONV does not differentiate between recent
C 2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Detecting Gene Duplications in Humans
Figure 2 The tree space of two human sequences and one chimpanzee sequence. The
left tree is the only one that represents two human inparalogues and their chimpanzee
orthologue. In the central and right trees, the human sequences are outparalogues.
Guanine Cytosine content of the inparalogue candidates was calculated. Pairs where the genetic distance was smaller than 0.02
on a scale of 0–0.1 and their Guanine Cytosine content was
greater than 60% were considered to be GC candidates. Figure 3
shows that all inparalogue candidates with a high Guanine Cytosine content also have a very short genetic distance from each
other and cluster into one well-defined group, which should be
0.06
0.04
0.00
0.02
Genetic distance
0.08
0.10
gene duplications and GC and does not take into account Guanine Cytosine content, so for the specific purpose of our study
we needed to develop a new GC detection tool. For each
human inparalogue candidate pair on the same chromosome,
the genetic distance between the sequences was calculated using the DNADIST program with the F84 substitution matrix
(Felsenstein, 1989). After calculating the genetic distances, the
0.35
0.40
0.45
0.50
0.55
0.60
0.65
Guanine Cytosine content
Figure 3 Guanine Cytosine content and divergence on same chromosome human
inparalogues; a set of gene conversion candidates. For each duplication event, genetic
distance was calculated between the human orthologue and its inparalogue candidate using
the F84 substitution matrix in the Phylip package (Felsenstein, 1989). Guanine Cytosine
content was calculated by counting the G and C bases in each inparalogue candidate, then
dividing by the full sequence length. The red circle shows that all duplications having high
Guanine Cytosine content also have a short genetic distance from their orthologues,
distinguishing them as likely to have undergone gene conversion.
C
2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Annals of Human Genetics (2010) 74,555–565
559
Y. Itan et al.
enriched for GC sequences. All inparalogue candidates that were
identified as GC candidates by the above criteria were removed.
It is important to note that this candidate gene converted set may
consist of false positives – genes that were assigned gene converted status but that are genuine inparalogues. However, since
only nine such sequences were removed, and since a reliable final
dataset of true inparalogues is of a greater importance, this should
not be considered as a problem.
Results
The Human Inparalogues Set
After applying the full process that we have described in this
work, 138 human inparalogues were identified, 104 of them
are duplications that occurred on the same chromosome,
while 34 are duplications found on different chromosomes.
In Table S1, we present this full set, including the orthologue’s Gene Ontology function (or other BioMart resources
when GO was not available) (Ashburner et al., 2000) and the
chromosome, start position, end position, and strand for each
duplication. See Table 3 for a summary of Table S1, including
a gene enrichment score for each functional group.
The Problems in Nonmodel Organisms’
Annotations
Chimpanzee is a nonmodel organism whose genome sequence (Mikkelsen et al., 2005) has been annotated through
projection from the human genome by Ensembl. The majority of annotated genomes available from Ensembl and BioMart
are of nonmodel organisms, such as the orangutan, macaque,
horse, cat, platypus, and more (Mikkelsen et al., 2005;
Hubbard et al., 2009). One major implication of annotating
a nonmodel organism’s genome is that only a very low proportion of genes will have been characterised experimentally.
The annotation of a nonmodel organism’s unknown genome
is carried out by projection, which involves aligning its transcripts to the known genes from the evolutionary nearest
genome(s) (Table 1), human in the case of chimpanzee. Note
that for low-coverage genomes or where genes cannot be annotated by projection from a model organism, Ensembl apply
another annotation category termed “novel genes.” Unlike
projection, this process allows the content of the original
assembly to be changed according to the model organism’s
gene sequence (see: http://www.ensembl.org/info/docs/
genebuild/genome_annotation.html). In this study, we will
term both “projected” and “novel” genes as “projected.” The
majority of annotated chimpanzee genes are projected from
known human genes. As a consequence, comparing a human
genome/proteome with its projected chimpanzee counterpart is essentially the same as comparing human genes with
“more poorly annotated” versions of themselves.
The Ensembl database (version 52) contains 21,416 human annotated coding genes – mostly “known,” and 19,829
chimpanzee coding genes – mostly “projected” and “novel”
(see also Table 1 for the number of known genes for each
species) – about 1600 more human annotated genes. The
results of this difference can be illustrated by performing InParanoid analysis using the human and chimpanzee proteomes
as input, after applying various filtering steps that are detailed
in the Materials and Methods section. The output of this
analysis is the full set of human and chimpanzee orthologue
groups (including those that contain inparalogues). We identified cases of human and chimpanzee orthologous groups
where one species has inparalogues, while the other species
has no inparalogues (in other words, human- or chimpanzeespecific inparalogue groups). This gives an indication of how
balanced the two genomes’ annotation is, under the null hypothesis that human and chimpanzee have a similar number of
species-specific gene duplications. The numbers of humanand chimpanzee-specific gene duplications show a massive
bias towards human duplications – 192 human-specific inparalogues groups, and only 33 in chimpanzee (Table 2), an
almost sixfold difference. While at first sight this may appear
Table 1 Gene categories in model and nonmodel organisms’ genomes and proteomes
560
Organism
Number known
protein-coding
genes
Number projected and
novel protein-coding
genes
Number
ambiguous
genes
Number
overlapping
genes
Number
haplotype
genes
Human (Homo sapiens)
Chimpanzee (Pan troglodytes)
Orangutan (Pongo pygmaeus abelii)
Macaque (Macaca mulatta)
Mouse (Mus musculus)
Cow (Bos taurus)
Horse (Equus caballus)
21388
2647
3813
874
23019
20471
723
28
17182
16255
21031
98
583
19599
221
1268
1245
1123
273
2745
153
2125
1226
1007
1854
1976
874
1024
741
88
-
Annals of Human Genetics (2010) 74,555–565
C 2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Detecting Gene Duplications in Humans
Table 2 The number of orthologous clusters having species-specific
inparalogues, detected by InParanoid
Organisms tested
N estimated
human-specific
duplications
N estimated
species-specific
duplications
Human-chimpanzee∗
Human-mouse
Human-orangutan∗
Human-macaque∗
Human-cow
Human-horse∗
192
207
171
208
220
196
33
326
47
111
279
204
Nonmodel organisms are identified by “∗ .” For hypothetical species
j and k, a cluster was detected for having species-specific inparalogues by counting the number of inparalogues for j and k, then
if the number of j inparalogues is greater than 0 and the number k
inparalogues is equal to 0 then the cluster is considered as having
j-specific inparalogues (and vice versa for k-specific inparalogues).
to indicate that the human lineage has had a significantly accelerated gene duplication rate compared to chimpanzee (or
alternatively that chimpanzee had a significant deceleration),
performing InParanoid analysis with the human proteome
and that of several other species indicates that this is not the
case. Rather there seems to be a bias in finding inparalogues
stemming from the nature of the specific species’ genome
annotation. When running the human proteome against organisms for which the majority of the genes are well annotated
(e.g., mouse or cow, see Tables 1 and 2) the tendency to detect
more human lineage-specific gene duplications was reversed.
We detected differences of 1.27 and 1.57 times more lineagespecific inparalogue-containing groups in cow and mouse,
respectively, than in human. The number of human peptide
sequences used is only 1.07 times larger than the chimpanzee’s,
while the number of mouse peptides is 1.1 times larger than
human and the cow’s is 0.9 times the size of the human’s, so
differences in the number of peptide sequences among different species is unlikely to account for the bias seen in the
human-chimpanzee comparison. Performing similar InParanoid runs and species-specific duplication analyses of the human proteome against other nonmodel primates revealed similar patterns to those seen with chimpanzee: a 3.64-fold more
human-specific inparalogue groups than orangutan-specific,
and 1.84-fold more human-specific inparalogue groups than
macaque-specific inparalogue groups. Conversely, performing the analyses described above for the human proteome
against horse, a nonprimate nonmodel organism, revealed
more similarity in the number of species-specific inparalogue
groups among the two species – 196 human-specific versus
204 horse specific (Table 2). However, the Ensembl genome
annotation for horse is projected from all known mammalian
genes, and – at a lower priority – from nonmammalian
C
vertebrates (http://www.ensembl.org/Equus_caballus/Info/
Index). Also, the horse genome assembly coverage in Ensembl version 52 was ×6.79 (where the assembly coverage
unit represents the average number of times that each unit of
the genome was sequenced), a relatively high level of coverage
(Ensembl unofficially defines “low coverage” as about ×2.5
or less). The similar number of human- and horse-specific inparalogue groups, together with the horse annotation process
and its high coverage show that when all available amniotes’
(i.e., mammals, reptiles, and birds) genomes are taken into
consideration then there is little numeric bias.
Identifying the problems in using Human haplotype data
As a part of the effort to map human genomic variants that
may be associated with susceptibility to common diseases, two
projects were conducted to identify haplotypes of two of the
Major Histocompatibility Complex (MHC) genes on Human chromosome 6 (COX and QBL). Susceptibility to more
than 100 diseases has been mapped to this region (Stewart
et al., 2004; Traherne et al., 2006). The Ensembl annotated
human genome database (Hubbard et al., 2009) includes 246
COX and 234 QBL haplotypes, as well as 741 protein variants (Hubbard et al., 2009; Smedley et al., 2009, Table 1).
The InParanoid database of orthologues and inparalogues
(http://inparanoid.sbc.su.se/cgi-bin/index.cgi) was collected
using the InParanoid algorithm with the full known proteome of each species. It attempts to identify inparalogues
using the longest protein sequence from each human coding
gene. However, the InParanoid database input is not filtered
for human haplotype data and so COX and QBL protein
sequences are included. The result is many variants of the
same genes collected from different genomes, which leads to
false detection of inparalogues (i.e., false positives) as haplotypes and protein variants are automatically categorised as
inparalogues, even though they are actually variants of the
same gene among different individuals. The use of haplotype data in InParanoid’s inparalogue detection procedure
has an effect similar to artificially adding hundreds of almost
identical copies of the same genes to the human genome
database.
The Problems in GC Data
Following gene duplication, adjacent paralogous are prone
to reciprocal unequal crossovers by virtue of the high degree of homology between them (Chen et al., 2007). As a
consequence of these unequal crossovers, an “acceptor” sequence is replaced, wholly or partly, by a sequence that is
copied from the “donor” sequence, whereas the sequence
of the donor remains unaltered. This process is termed GC
2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Annals of Human Genetics (2010) 74,555–565
561
Y. Itan et al.
Table 3 Summary of human lineage gene duplications
Inparalogue GO function
Number of
duplications
Cellular regulation
12
ENSP00000325681,ENSP00000340685,ENSP00000341674,
ENSP00000361400,ENSP00000348915,ENSP00000379097,
ENSP00000332124,ENSP00000386810,ENSP00000377304
2.29
Immune system
22
ENSP00000282633,ENSP00000289473,ENSP00000297496,
ENSP00000303532,ENSP00000324633,ENSP00000333329,
ENSP00000335281,ENSP00000335307,ENSP00000340685,
ENSP00000367605,ENSP00000374785,ENSP00000367605,
ENSP00000374785,ENSP00000374789,ENSP00000374792,
ENSP00000374808,ENSP00000382951
5.33
Inter/intra cellular signalling
16
ENSP00000056217,ENSP00000289473,ENSP00000328013,
ENSP00000328230,ENSP00000341674,ENSP00000342143,
ENSP00000344026,ENSP00000352498,ENSP00000358942,
ENSP00000370076,ENSP00000386810
0.17
Membrane protein
11
ENSP00000226272,ENSP00000277634,ENSP00000328443,
ENSP00000329825,ENSP00000333071
1.07
Metabolic and catabolic processes
19
ENSP00000255845,ENSP00000305847,ENSP00000333329,
ENSP00000347318,ENSP00000348915,ENSP00000350939
2.15
Nucleus activity
3
ENSP00000358518,ENSP00000368039
0.00
Sensory perception
11
ENSP00000305469,ENSP00000324687,ENSP00000329982,
ENSP00000331774,ENSP00000335529
3.02
Transcription and translation
regulation
15
ENSP00000305857,ENSP00000322697,ENSP00000334501,
ENSP00000338561,ENSP00000346045,ENSP00000346045,
ENSP00000361377,ENSP00000366381,ENSP00000375129,
ENSP00000378953,ENSP00000302756,ENSP00000375129
0.02
Unknown
10
ENSP00000329982,ENSP00000340685,ENSP00000361377,
ENSP00000374792,ENSP00000377554,ENSP00000382514
-
Noncoding duplication
19
ENSP00000236937,ENSP00000259216,ENSP00000329982,
ENSP00000333329,ENSP00000346022,ENSP00000348915,
ENSP00000352498,ENSP00000359991,ENSP00000361377,
ENSP00000374789,ENSP00000374792,ENSP00000387009
-
Orthologues Ensembl IDs1
Functional group’s
enrichment score2
1
Note that since an orthologue was often copied more than one time, there are cases where the number of orthologue IDs is smaller than the
number of duplications.
2
Gene enrichment score for functional groups was calculated using DAVID (http://david.abcc.ncifcrf.gov/). The score is the geometric mean
of the group’s p-value, so higher score means lower p-value – gene enriched groups.
(Chen et al., 2007). As a consequence of GC, the two copies
of the gene can revert to a very high degree of similarity,
even though their duplication event may predate a speciation event. In the case of a gene duplication occurring before the most recent speciation event, followed by GC, any
currently available inparalogue detection method is likely to
identify the two copies as inparalogues, when in fact they
are outparalogues. There are no bioinformatics filters currently available for GC detection that are suitable for the
task of this study, and so we expect that all currently available inparalogue detection methods will include false-positive
inparalogues.
562
Annals of Human Genetics (2010) 74,555–565
The Problems in Ambiguous Data
As a part of the Ensembl (Hubbard et al., 2009) gene
annotation process of each species, transcripts are aligned
to the whole sequenced genome to identify the chromosomal location of each gene. Due to low sequence coverage or low transcript quality, there are cases where a
transcript cannot be mapped to specific chromosomal regions, and consequently the gene’s chromosomal location
is categorised as “random” (when a specific chromosome
is identified), “Un” (when the chromosome is unknown),
or “NT” (like “Un,” but with the original contig’s name
C 2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Detecting Gene Duplications in Humans
specified as a chromosome). Here, we refer to these three
classes of ambiguous annotation data as “ambiguous.” The
numbers and distributions of these “ambiguous” genes vary
among the different annotated species. The Ensembl database
(http://www.ensembl.org/index.html) includes 221 ambiguous human genes and 1268 ambiguous chimpanzee genes
(Table 1) that were used to assemble the current InParanoid
human-chimpanzee database (O’Brien et al., 2005). Although
the sequence quality for some of these genes may be adequate,
the fact that they are “ambiguous” (as described above) makes
it difficult to detect whether they overlap with other genes.
In addition, if they are identified as gene duplications it is
difficult to know if they are tandem duplications (on a similar chromosome) or gene duplications located on different
chromosomes. Altogether, the fact that these genes cannot be
traced to a specific location suggests a problem in the quality
of the genome annotations, and so using these genes makes
the current inparalogue dataset less reliable.
Discussion
This study describes the problems encountered in detecting a
reliable set of human-chimpanzee human inparalogues when
using the currently available homology detection methods.
These problems include the use of nonmodel organisms, human haplotype data, proteome data, and GC. We go on to
describe an algorithm that we have developed to overcome
these problems and find a set of human inparalogues.
The algorithm that we have developed and the filtering
processes applied are relevant for any model/nonmodel organism inparalogue detection project. For example, in a proposed future project the algorithm can be applied to find cow
(an organism with a majority of genes known) – dog (an organism with a majority of genes projected) inparalogues, using
a rat (a model organism with a majority of known genes) as
an outgroup, where cow and dog diverged about 62.3 mya,
while cow-dog diverged from rat about 95.3 mya (Benton &
Donoghue, 2007). In this example, since cow and dog are
more distantly related than human-chimpanzee, we suggest
the use of BLAST instead of BLAT, as it is more sensitive for
more distant homologies. We see no reason why the method
we describe should not be used to identify the full inparalogue
datasets for all model/nonmodel organism pairs for which
high-coverage genome data are available. However, when applying the algorithm for detecting other species-specific gene
duplications, the issue of the genome assembly quality should
be taken into account. Other than inaccuracies in the genomic
sequence, which is the obvious consequence of a low-quality
genome assembly, there are large-scale errors in such genomes.
For example, the chimpanzee genome assembly coverage is
inferior than human’s, and as a result of the genome misassembly many false duplications appear in the chimpanzee
C
genome (Kelley & Salzberg, 2010). Thus, any attempt to discover chimpanzee-specific gene duplications would need to
account for these false duplications. This problem of the current chimpanzee assembly did not affect our results because
we used only the chimpanzee orthologues, of which there
was one copy only.
The main problem that we detected in the InParanoid
database is the abundance of false positives, as elaborated in
the Material and Methods and the Results sections. To estimate the potential of false positives and false negatives in our
method, we compared our orthologue clusters results with the
results of running InParanoid with the human and the chimpanzee proteomes (the longest peptide from each Ensembl
coding gene, as it is in the InParanoid human-chimpanzee
inparalogues database). For orthologue clusters that contain
human inparalogues the number of orthologues and inparalogues for the human-chimpanzee run is 1609, while our
results contain a total number of 185: a maximum of 1424
false positives in the InParanoid database. We estimate that
the maximum number of false negatives, true human inparalogues that were not detected in our study is 76. This
number was estimated by the number of orthologues and inparalogues that were detected in the InParanoid run and not
in our study. However, we stress that the true false-positives
number is likely to be significantly smaller as the vast number
of false positives in the human-chimpanzee run was likely to
affect the quality of clustering and inparalogue detection. We
estimate that the maximum number of false positives, false
human inparalogues that were detected in our study is 101.
This number was estimated from the number of orthologues
and inparalogues that were detected in our study run but not
in the InParanoid run. We stress again that due to the vast
number of false positives in the human-chimpanzee InParanoid run this number is likely to be significantly lower as the
main effect of our method is to filter false positives from the
results, and so our method is much more prone to false negatives than to false positives. Consequently, we believe that
the true false-positive number from our analysis is close to
zero. The number of orthologues and inparalogues that were
detected in our study run and also detected in the InParanoid
run is 84.
Some improvements that we envisage for future versions
of the algorithm include a more robust process of detecting GC, and the creation of a fully automated pipeline of the
process. Such an application would take as an input two evolutionary neighbouring species and their outgroup species proteomes, and provide as an output the full set of inparalogues
after performing the filtering, clustering, and tree inference
procedures.
To detect the full regions of the gene duplications (rather
than only the region represented by the protein sequence),
InParanoid could be adapted to use BLASTN instead of
2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Annals of Human Genetics (2010) 74,555–565
563
Y. Itan et al.
BLASTP. This would result in the input being a nonredundant
filtered genome, rather than the proteome data of the species.
However, testing this option has revealed that current conventional computing power is insufficient for such a task. A test
run of 3000 human and chimpanzee sequences (1500 from
each species), where the upper threshold for one sequence
length was 300,000 base pairs required about 4 GB RAM.
A full genome InParanoid run with about 20,000 genes from
each species and no sequence length threshold (which may include sequences of a million base pairs or more) would require
about 50 GB RAM (as an approximation, assuming that the
full nonredundant genomes will be more than 10–15 times
larger than the 3000 human and chimpanzee dataset that was
tested). Adapting InParanoid for distributed computing may
enable such a task in a feasible timeframe (i.e., in a number of
weeks or less) and amount of computer memory.
In addition to gene duplications, identifying other classes
of “all human-lineage” genomic events is also likely to be informative on the evolution of the unique human phenotype.
Such events include pseudogenization (Wang et al., 2006),
regulatory region changes (Montgomery, 2009), retroviral
insertions and sequence deletions (Costantini & Bernardi,
2009), and genomic rearrangement (Zhang et al., 2009). For
all such searches, the particulars of differences in genome annotation quality would need to be accounted for.
Acknowledgements
We thank Ziheng Yang, Neil Bradman, Richard Emes, and Jessica
Vamathevan for their very important contribution, as well as
the Ensembl (particularly Bert Overduin) and BLAT teams for
their ongoing support and patience. YI was funded by the UCL
ORS, UCL Graduate School, B’nai B’rith and the Anglo-Jewish
Association scholarships.
References
Altenhoff, A. M. & Dessimoz, C. (2009) Phylogenetic and functional
assessment of orthologs inference projects and methods. PLoS
Comput Biol 5, e1000262.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J.
(1990) Basic local alignment search tool. J Mol Biol 215, 403–410.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H.,
Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig,
J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A.,
Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin,
G. M. & Sherlock, G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25,
25–29.
Bailey, J. A. & Eichler, E. E. (2006) Primate segmental duplications:
crucibles of evolution, diversity and disease. Nat Rev Genet 7,
552–564.
Benton, M. J. & Donoghue, P. C. (2007) Paleontological evidence
to date the tree of life. Mol Biol Evol 24, 26–53.
Berglund, A. C., Sjolund, E., Ostlund, G. & Sonnhammer, E. L.
(2008) InParanoid 6: eukaryotic ortholog clusters with inparalogs.
Nucleic Acids Res 36, D263–D266.
564
Annals of Human Genetics (2010) 74,555–565
Britten, R. J. (2002) Divergence between samples of chimpanzee
and human DNA sequences is 5%, counting indels. Proc Natl Acad
Sci U S A 99, 13633–13635.
Chen, J. M., Cooper, D. N., Chuzhanova, N., Ferec, C. & Patrinos,
G. P. (2007) Gene conversion: mechanisms, evolution and human
disease. Nat Rev Genet 8, 762–775.
Costantini, M. & Bernardi, G. (2009) Mapping insertions, deletions
and SNPs on Venter’s chromosomes. PLoS One 4, e5972.
Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with
high accuracy and high throughput. Nucleic Acids Res 32, 1792–
1797.
Enard, W., Przeworski, M., Fisher, S. E., Lai, C. S., Wiebe, V.,
Kitano, T., Monaco, A. P. & Paabo, S. (2002) Molecular evolution
of FOXP2, a gene involved in speech and language. Nature 418,
869–872.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a
maximum likelihood approach. J Mol Evol 17, 368–376.
Felsenstein, J. (1989) PHYLIP – Phylogeny inference package
(Version 3.2). Cladistics 164–166.
Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L.
& Postlethwait, J. (1999) Preservation of duplicate genes by
complementary, degenerative mutations. Genetics 151, 1531–
145.
Galtier, N. (2003) Gene conversion drives GC content evolution in
mammalian histones. Trends Genet 19, 65–68.
Galtier, N., Piganeau, G., Mouchiroud, D. & Duret, L. (2001) GCcontent evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics 159, 907–911.
Hubbard, T. J., Aken, B. L., Ayling, S., Ballester, B., Beal, K.,
Bragin, E., Brent, S., Chen, Y., Clapham, P., Clarke, L., Coates,
G., Fairley, S., Fitzgerald, S., Fernandez-Banet, J., Gordon, L.,
Graf, S., Haider, S., Hammond, M., Holland, R., Howe, K.,
Jenkinson, A., Johnson, N., Kahari, A., Keefe, D., Keenan, S.,
Kinsella, R., Kokocinski, F., Kulesha, E., Lawson, D., Longden,
I., Megy, K., Meidl, P., Overduin, B., Parker, A., Pritchard, B.,
Rios, D., Schuster, M., Slater, G., Smedley, D., Spooner, W.,
Spudich, G., Trevanion, S., Vilella, A., Vogel, J., White, S., Wilder,
S., Zadissa, A., Birney, E., Cunningham, F., Curwen, V., Durbin,
R., Fernandez-Suarez, X. M., Herrero, J., Kasprzyk, A., Proctor,
G., Smith, J., Searle, S. & Flicek, P. (2009) Ensembl 2009. Nucleic
Acids Res 37, D690–D697.
Hunt, K. D. (1994) The evolution of human bipedality: ecology and
functional morphology. J Hum Evol 26, 183–202.
Katoh, K., Kuma, K., Toh, H. & Miyata, T. (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment.
Nucleic Acids Res 33, 511–518.
Katoh, K., Misawa, K., Kuma, K. & Miyata, T. (2002) MAFFT: a
novel method for rapid multiple sequence alignment based on fast
Fourier transform. Nucleic Acids Res 30, 3059–3066.
Kelley, D. R. & Salzberg, S. L. (2010) Detection and correction
of false segmental duplications caused by genome mis-assembly.
Genome Biol 11, R28.
Kent, W. J. (2002) BLAT–the BLAST-like alignment tool. Genome
Res 12, 656–664.
Koonin, E. V. (2005) Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 39, 309–338.
Kumar, S., Filipski, A., Swarna, V., Walker, A. & Hedges, S. B.
(2005) Placing confidence limits on the molecular age of the
human-chimpanzee divergence. Proc Natl Acad Sci U S A 102,
18842–18847.
Marais, G. (2003) Biased gene conversion: implications for genome
and sex evolution. Trends Genet 19, 330–338.
C 2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Detecting Gene Duplications in Humans
Mikkelsen, T. S., Hillier, L. W., Eichler, E. E., Zody, M. C., Jaffe,
D. B., Yang, S. -P., Enard, W., Hellmann, I., Lindblad-Toh, K.,
Altheide, T. K., Archidiacono, N., Bork, P., Butler, J.,Chang, J.
L., Cheng, Z., Chinwalla, A. T., Dejong, P., Delehaunty, K. D.,
Fronick, C. C., Fulton, L. L., Gilad, Y., Glusman, G., Gnerre,
S.,Graves, T. A., Hayakawa, T., Hayden, K. E., Huang, X., Ji,
H., Kent, W. J., King, M. -C., Kulbokasiii, E. J., Lee, M. K.,
Liu, G., Lopez-Otin, C., Makova, K. D., Man, O., Mardis, E.
R., Mauceli, E., Miner, T. L., Nash, W. E., Nelson, J. O., Paabo,
S., Patterson, N. J., Pohl, C. S., Pollard, K. S.,Prufer, K., Puente,
X. S., Reich, D., Rocchi, M., Rosenbloom, K., Ruvolo, M.,
Richter, D. J., Schaffner, S. F., Smit, A. F. A., Smith, S. M.,
Suyama, M.,Taylor, J., Torrents, D., Tuzun, E., Varki, A., Velasco,
G., Ventura, M., Wallis, J. W., Wendl, M. C., Wilson, R. K.,
Lander, E. S. & Waterston, R. H. (2005) Initial sequence of the
chimpanzee genome and comparison with the human genome.
Nature 437, 69–87.
Montgomery, S. (2009) Current computational methods for prioritizing candidate regulatory polymorphisms. Methods Mol Biol 569,
89–114.
Nieuwenhuys, R., Voogd, J. & Van Huijzen, C. (2007) The human
central nervous system: a synopsis and atlas. Berlin: Springer-Verlag.
Notredame, C., Higgins, D. G. & Heringa, J. (2000) T-Coffee: a
novel method for fast and accurate multiple sequence alignment.
J Mol Biol 302, 205–217.
O’Brien, K. P., Remm, M. & Sonnhammer, E. L. (2005) Inparanoid:
a comprehensive database of eukaryotic orthologs. Nucleic Acids
Res 33, D476–D480.
Ohno, S. (1970) Evolution by gene duplication. Berlin: Springer-Verlag.
Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S. & Reich, D.
(2006) Genetic evidence for complex speciation of humans and
chimpanzees. Nature 441, 1103–1108.
Prince, V. E. & Pickett, F. B. (2002) Splitting pairs: the diverging
fates of duplicated genes. Nat Rev Genet 3, 827–837.
Remm, M., Storm, C. E. & Sonnhammer, E. L. (2001) Automatic
clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314, 1041–1052.
Sawyer, S. (1989) Statistical tests for detecting gene conversion. Mol
Biol Evol 6, 526–538.
Smedley, D., Haider, S., Ballester, B., Holland, R., London, D.,
Thorisson, G. & Kasprzyk, A. (2009) BioMart – biological queries
made easy. BMC Genomics 10, 22.
Spencer, C. C., Deloukas, P., Hunt, S., Mullikin, J., Myers, S.,
Silverman, B., Donnelly, P., Bentley, D. & Mcvean, G. (2006)
The influence of recombination on human genetic diversity. PLoS
Genet 2, e148.
Stauffer, R. L., Walker, A., Ryder, O. A., Lyons-Weiler, M. &
Hedges, S. B. (2001) Human and ape molecular clocks and constraints on paleontological hypotheses. J Hered 92, 469–474.
Steiper, M. E. & Young, N. M. (2006) Primate molecular divergence
dates. Mol Phylogenet Evol 41, 384–394.
Stewart, C. A., Horton, R., Allcock, R. J., Ashurst, J. L., Atrazhev,
A. M., Coggill, P., Dunham, I., Forbes, S., Halls, K., Howson,
J. M., Humphray, S. J., Hunt, S., Mungall, A. J., Osoegawa,
K., Palmer, S.,Roberts, A. N., Rogers, J., Sims, S., Wang, Y.,
Wilming, L. G., Elliott, J. F., De Jong, P. J., Sawcer, S., Todd, J.
A., Trowsdale, J. & Beck, S. (2004) Complete MHC haplotype
sequencing for common disease gene mapping. Genome Res 14,
1176–1187.
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997) A genomic
perspective on protein families. Science 278, 631–637.
Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) CLUSTAL
C
W: improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res 22, 4673–
4680.
Traherne, J. A., Horton, R., Roberts, A. N., Miretti, M. M., Hurles,
M. E., Stewart, C. A., Ashurst, J. L., Atrazhev, A. M., Coggill,
P., Palmer, S., Almeida, J., Sims, S., Wilming, L. G., Rogers, J.,
De Jong, P. J., Carrington, M., Elliott, J. F., Sawcer, S., Todd, J.
A., Trowsdale, J. & Beck, S. (2006) Genetic analysis of completely
sequenced disease-associated MHC haplotypes identifies shuffling
of segments in recent human history. PLoS Genet 2, e9.
Van Noort, V., Snel, B. & Huynen, M.A. (2003) Predicting gene
function by conserved co-expression. Trends Genet 19, 238–242.
Wang, X., Grus, W. E. & Zhang, J. (2006) Gene losses during human
origins. PLoS Biol 4, e52.
Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril,
J. F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson,
M., An, P., Antonarakis, S. E., Attwood, J., Baertsch, R.,Bailey,
J., Barlow, K., Beck, S., Berry, E., Birren, B., Bloom, T., Bork,
P., Botcherby, M., Bray, N., Brent, M. R., Brown, D. G., Brown,
S. D., Bult, C., Burton, J., Butler, J., Campbell, R. D., Carninci,
P., Cawley, S.,Chiaromonte, F., Chinwalla, A. T., Church, D.
M., Clamp, M., Clee, C., Collins, F. S., Cook, L. L., Copley,
R. R., Coulson, A., Couronne, O., Cuff, J., Curwen, V., Cutts,
T., Daly, M., David, R., Davies, J., Delehaunty, K. D., Deri, J.,
Dermitzakis, E. T., Dewey, C., Dickens, N. J., Diekhans, M.,
Dodge, S., Dubchak, I., Dunn, D. M., Eddy, S. R., Elnitski, L.,
Emes, R. D., Eswara, P., Eyras, E., Felsenfeld, A., Fewell, G. A.,
Flicek, P., Foley, K., Frankel, W. N., Fulton, L. A., Fulton, R.
S., Furey, T. S., Gage, D., Gibbs, R. A., Glusman, G., Gnerre,
S., Goldman, N., Goodstadt, L., Grafham, D., Graves, T. A.,
Green, E. D., Gregory, S., Guigo, R., Guyer, M., Hardison, R.
C., Haussler, D., Hayashizaki, Y., Hillier, L. W., Hinrichs, A.,
Hlavina, W., Holzer, T., Hsu, F., Hua, A., Hubbard, T., Hunt, A.,
Jackson, I.,Jaffe, D. B., Johnson, L. S., Jones, M., Jones, T. A., Joy,
A., Kamal, M., Karlsson, E. K., et al. (2002) Initial sequencing and
comparative analysis of the mouse genome. Nature 420, 520–562.
Yang, Z. (2006) Computational molecular evolution. Oxford: Oxford
University Press.
Zhang, F., Carvalho, C. M. & Lupski, J. R. (2009) Complex human
chromosomal and genomic rearrangements. Trends Genet 25, 298–
307.
Supporting Information
Additional supporting information may be found in the online
version of this article:
Table S1. Human lineage gene duplications.
As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such
materials are peer-reviewed and may be re-organised for
online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information
(other than missing files) should be addressed to the authors.
Received: 7 January 2010
Accepted: 5 August 2010
2010 The Authors
C 2010 Blackwell Publishing Ltd/University College London
Annals of Human Genetics Annals of Human Genetics (2010) 74,555–565
565
Download