Supplementary Methods - Word file (132 KB )

advertisement
Supplementary methods:
Choice of genes on the array. The genes on the multi-species cDNA array were chosen
from lists of genes with Refseq identifiers that were previously shown to be expressed in
primate livers, according to GeneCards database (http://www.genecards.org/) or Enard et
al. (2002). Genes were not chosen based on their function or known association to human
disease. However, as detailed in Gilad et al (2005), we tried to ensure that the amplified
cDNA probes would include no more than a single >100-bp segment with a matching
sequence elsewhere in the human genome at an identity cutoff of 85%. This procedure
excluded genes that are located within recent segmental duplications in the human
genome and are likely to differ in their copy number between different species. Indeed, of
757 genes on our array whose physical location in the human genome can be obtained
directly by using their Refseq identifier, only one (PARG) is located within a recent
(>98%) segmental supplication, based on the Segmental Duplication Database
(http://humanparalogy.gs.washington.edu/).
We focus on liver because, in addition to the obvious cognitive and linguistic
differences between humans and non-human apes, humans are the only primate to
regularly consume cooked food, with the earliest unequivocal evidence for controlled use
of fire dating to ~400,000 years ago (Jones, 1992). The digestion of cooked food, among
other shifts in nutrition, has led to a human diet that differs sharply from that of our close
relatives (Wrangham, 1999). Such changes are likely to been accompanied by molecular
adaptations (e.g., Neel, et al. 1998), notably in the liver.
Since the genes were chosen based on their expression in humans, a concern is
that our sample may be biased towards genes that are highly expressed in humans
1
compared with the other species. In order to avoid this bias, we rotated the leading
species in the first PCR amplification i.e., the species on which each set of primers was
first tested. Specifically, approximately 25% of the primers were tested on each species.
If a primer pair successfully amplified a unique PCR product of the expected size in the
leading species, we obtained the product for this gene from the other three species as
well. If not, this primer set was excluded. The rationale behind this approach is that the
first test of the primers is more likely to yield successful amplification if the gene is
highly expressed in the species from which cDNA is used as template.
Overall, we tested approximately 1400 primer pairs on a leading species, 1056 of
which resulted in successful amplification. Of these, successful amplifications were
obtained in all species for 907 genes as detailed in Gilad et al. (2005). A recent survey of
gene expression in human and chimpanzee liver detected 9390 expressed genes
(Khaitovich et al. 2005); our array therefore includes probes for roughly 10% of genes
that are expressed in the liver. Supplementary table 1 contains the Refseq identifiers of
these 907 genes along with their multiple tissue expression patterns based on Novartis
Gene Expression Atlas (http://expression.gnf.org/FAQ.html#abscall when available.
Samples and hybridizations. We extracted RNA from the liver of five adult males from
each of the four species. One of the advantages of working with livers is that it is one of
the most homogeneous tissues with respect to cellular composition (Balashova et al.
1984). This is in contrast to brain tissue for example, which may differ substantially in
their cellular composition between samples and, in particular, between species (e.g.,
Brodal et al. 1983). For humans, healthy tissue samples were obtained from adult male
2
liver resections performed at Yale Hospital (in accordance with Yale University HIC
regulations). Non-human primate samples were collected from adult male chimpanzees,
orangutans and rhesus macaques that died of natural causes or were euthanized following
a liver-unrelated disease. Sample preparation, hybridizations, and washes were performed
on our multi-primate cDNA array platform, as previously described (Gilad et al. 2005).
Analysis. In primate inter-species gene expression comparisons, we are unable to stage
the tissues or minimize environmental differences among samples. Consequently, we
chose a reference design and laboratory procedures that are aimed at minimizing the
technical variance. Specifically, we used five individuals from each of the four species
and four technical replicates of each comparison (for a total of 80 hybridizations). We
performed all four hybridizations with the same RNA sample on the same day, further
minimizing the technical variance. Preparation of the reference RNA for the entire
experiment was done in advance.
The common reference design facilitates an analysis in which we can estimate
components of variation for species, individuals within species, arrays and so on. The
replicate arrays for each individual give us precise estimates of the expression for each
individual and hence increase our power to detect species and other differences at the
relatively small cost of confounding individual measurements with the technical variance
associated with RNA preparation and varying days. While previous studies performed
permutation tests to assess whether interspecies divergence was greater than expected
from within species differences, our approach presents the advantage of estimating
variation in expression levels both within and between species. In this respect, our
3
approach resembles the HKA test (Hudson, et al. 1987), which compares nucleotide
polymorphism and divergence levels across multiple loci.
The microarray experiment was conducted as a multi-level design where nested
within each of the four target species t=h,c,o,r we have 5 individuals labelled i=1…5,
within each individual we have 4 technical replicates labelled j=1…4 and within each
array, for each gene, we have 4 probes p=h,c,o,r. For a given gene let Itp be the observed
fluorescence intensity in the red or green channel which corresponds to target species t on
probe species p. We define
E  log 2 I tp   tp
(0.1)
where E represents expectation. The attenuation caused by the sequence mismatch
occurs when the target species and the probe species are different and we assume that this
factor attenuates the intensity by a given amount (ktp) for each gene. Therefore
I tp  ktp I tt .
Taking expectations of the log, we get
tp   tp  tt ,
(0.2)
where  tp  log 2 ktp . Each spot on the array measures the differential expression between
the target species t and the human reference h giving a log-ratio value Mtp=log2(Rtp/Ghp)
where Rtp and Ghp are the red (Cy5) and green (Cy3) intensities. The log expression ratio
for each spot will be a linear combination of the difference in RNA expression between
4
the two species, the attenuation in expression caused by sequence mismatch, and a
possibly intensity dependent dye bias term.
E ( M tp )  ( tt   tp  r )  ( hh   hp  g )
 t  ( tp   hp )  (r  g )
(0.3)
where t=tt – hh is the true log fold change in expression for species t relative to
human, tp – hp is a difference of log attenuation factors, and r – g is the intensity
dependent dye bias.
A direct application of the lowess normalization procedure to log ratios for probes
of species p only, will generally result in biased estimates of expression levels. This
procedure would lead to a line centered at the local mean (across genes) of tp – hp ,
which is not zero in general. Carrying out a similar normalization to all probes of the four
species together would (at best) lead to a line centered at the local mean of
1
4
( tt   ht )  ( th   hh )  ( tp   hp )  ( tp   hp ) 
 14 ( th   ht )  ( tp   hp )  ( tp   hp ) 
where p  and p are the two probe species which are not human or the target species t.
This again is not expected to be zero. However, looking at this expression we see that if
we make the reasonable assumption that th = ht, a standard procedure applied to the log
ratios Mtp for p=t and p=h, should lead to an approximately unbiased normalization
curve. This is what was implemented in Gilad et al. (2005) and is appropriate if not too
many of the t are non-zero, or about half are positive and half negative (see Yang et al.,
2002, for more discussion). We applied this procedure and the resulting adjustment was
5
applied to log ratios for all probes. An example of an array hybridized with rhesus as the
target species against the human reference is shown below. The red spots are the
unnormalized rhesus probes with the red line a lowess fit through those probes. The black
dots are the unnormalized human probes with the black line a lowess fit through those
probes. A lowess fit using both sets of probes results in the blue line which is what we
use to adjust all the probes on the array.
Figure 1: Log-ratios (M) vs log intensities (A) for a Rhesus-macaque/ human hybridization. The
rhesus probes are shown in red with the red line a lowess curve through these probes. The human
probes are shown in black with the black line a lowess curve through these probes and the blue line is
the lowess curve through both sets of probes used in the normalization.
Linear Modeling
The fixed effects in the linear model (equation 0.3) describe expression levels t of the
target species t relative to human h and the expression attenuation tp caused by the target
and probe sequence mismatch. After normalization we performed a check of additivity by
examining the residuals for several genes after fitting the fixed effects using ordinary
least squares. We stratified the residuals by target species and by probes species and we
6
did not find any outstanding deviations from the model. Examples of these plots are
shown in Figure 2 for four random genes.
There are also a number of error levels in the experiment which could be included
in the model and we used analysis of variance to investigate the relative contribution of
error levels. For each gene we have 320 measurements arising from 4 species × 5
individuals × 4 arrays × 4 probes. We initially estimated the fixed effects by ordinary
least squares. Next we analyzed the residuals by estimating the variance components for a
model with random effects for individuals, arrays within individuals and error, that is
rtijp     ti   tij   tijp
(0.4)
where rtijp are the residuals of the measurements after subtracting the fixed effects and 
is the intercept term. The effects for individuals ti are assumed to be uncorrelated with
mean zero and variance   , and the effects tij, for arrays within individuals, are
assumed to be uncorrelated with mean zero and variance   . Finally the residual errors
tijp, are assumed to be uncorrelated with mean zero and variance   . All of these
analyses are gene specific but here we have suppressed the gene labels. The mean squares
and variance components were estimated by analysis of variance for each gene.
7
Figure 2: Examination of the residuals of four random genes on the array. Each row represents a
different gene. The first figure in each row plots the residuals for the 320 observation against the
fitted values and the four colors represent the four probe species. The second figures are boxplot of
the residuals stratified by target species and the third figures are normal quantile-quantile plots.
8
Negative variance components were dealt with in the way outlined in Thompson (1962)
by setting such components to zero and recalculating the lower level component by
pooling the data from the two levels. We found that about 35% of the genes produced
estimates of ˆ   0 when calculated this way. We estimated the variance components for
each of the species separately, and for the species together, and these produced very
similar results. Boxplots of the three variance components for all the genes for each of the
species separately are shown in figure 3. It can be seen that the term for the arrays ( ˆ  )
is substantially smaller than the individual and error terms and therefore we did not keep
this term in the model. We could further split off a probe term tip from the residual error
term tijp as in the following model
rtijp    ti   tij   tip   tijp .
(0.5)
The variance estimates from this model give a similar picture to the previous model with
the error and individual terms dominating and the array and probe terms substantially
smaller (data not shown). We concluded that neither the array nor probe terms were
warranted and therefore fitted a model with only random effect terms for individual and
error as outlined in the paper.
9
Figure 3: Boxplots of log variance components of all the genes for the three random effects from
equation 0.4. For each species the component for the arrays is significantly smaller than the other
two terms and about 35% are equal to zero and are not shown in the plot.
Inferring ancestral state to determine changes in human and chimp lineages. We chose genes
that showed significant differences between human and chimpanzee using the likelihood ratio
tests on pairs of species. From these, we restricted ourselves to genes which were not
significantly different between orangutan and rhesus. There were 84 genes satisfying these
criteria. We used the mean of the orangutan and rhesus expression as the expression of the
common primate ancestor (a fig. 4). Changes can occur in any of the three branches connecting
the species. The amount of change in expression in each branch is calculated by inferring the
expression relative to the most recent common ancestor of human and chimpanzees denoted by
x following Rossnes et al. (2005). If the expression of the common primate ancestor, a, is
between the expression of human, h, and chimpanzee, c, then the expression at x is deemed to
10
be equal to that at the common ancestor, otherwise, the log expression at x is half way between
the expression at the ancestor a and the closer of the human and chimpanzee. More than twothirds of the genes have the expression at the ancestor a between the expression of human and
chimpanzee. Once this is inferred, we can calculate the log expression change in the branches
connecting the ancestor, human and chimpanzee.
a
x
h
c
Figure 4: Changes in gene expression can occur in any of the 3 branches connecting human (h)
chimpanzee (c) and their common ancestor (a). The expression at a is calculated as the mean of the
Orangutan and Rhesus expression after selecting genes for which these are not differentially
expressed. The amount of change in relative expression in each branch is calculated by inferring the
relative expression at x (see methods). If the expression of the outgroup, a, is between h and c then x
= a, otherwise, x is half way between the expression of a and the closer of h and c.
Search for genes involved in cancer. In order to identify genes that are associated with
human cancer, we performed an automated search (in October 2005) of the Descriptions,
OMIM Disorder, and Protein Domains and Families fields of each gene entry in
GeneCards database for the following strings (also as part of a word): cancer,
carcinoma, lymphoma, leukemia, malignant, tumor. This search yielded 53 genes.
11
Download