Supplementary Statistics

advertisement
Supplementary Statistics
1) Statistical analysis of chromosome distribution in exponentially growing cells.
Results in the text are for the 108 interactions that occurred in the untreated sample
and were flanked by MspI restriction sites. Tests were also performed for all
interactions (123); and interactions flanked by restriction sites but eliminating
chromosome 10, where for the treated case a large number of interactions without
flanking sites suggested possible registration problems. We note where these different
interaction sets lead to different conclusions. Although chromosome 10 does not have
any interactions for the untreated case, tests of how the interactions are distributed
across the genome are affected by elminating chromosome 10.
Chromosomes were concatenated and normalized to length 1; the positions of the
interactions were converted to this scale.
The Kolomgorov-Smirnov test for unifomity was used to detect any large scale
differences from the null hypothesis that the positions of the interactions are
uniformly distributed across the genome (ie large regions with an elevated rate of
interactions). We fail to reject the null hypothesis (p=0.149).
The Kolomgorov Smirnov test for an exponential distribution of inter-event distances,
using the observed mean of inter-event distances, was used to test for an excess of
long or short inter-event distances. We fail to reject the null hypothesis (p=0.265).
This test was significant (p=0.006) when all interactions (i.e. MspI flanked and
unflanked) were included, showing an excess of very short and very long distances.
A Chi-Squared goodness of fit test was used to detect small scale clumping of the
interactions. The Poisson process implies a Poisson distribution for the number of
observations in an interval of fixed width. The genome was divided into 50 intervals
of equal length. The null hypothesis for the test is that the number of interactions in
each interval should be Poisson with mean 108/50=2.16. The test requires that, under
the null hypothesis, the number of intervals with 0, 1, 2, 3, or ≥4 interactions exceeds
5, otherwise further combining of the categories is required. The division of the
genome into 50 intervals was chosen as a convenient number of intervals that met this
requirement. The observed and expected number of intervals, with the given number
of interactions, are shown in the table below.
Number of
hits
Expected
no. of
intervals
with this
many hits
Observed
no. of
intervals
with this
many hits
0
1
2
3
>=4
5.77
12.46
13.45
9.69
8.64
12
7
16
5
10
The Chi-squared statistic (sum over cells of [(observed-expected)2/ expected]) is
12.09, to be compared to a Chi-squared distribution with 4 degrees of freedom. The pvalue is 0.0167, so the null hypothesis is rejected.
We see there is an excess of intervals with zero hits, and of cells with >=4 hits. Thus,
when examined on a fine scale, the genome has excess regions without interactions,
and excess regions of dense interactions compared to a Poisson process. This test is
not significant (p=0.734) when Chromosome 10 is eliminated, as many of the excess
intervals with zero hits are eliminated.
Disregarding the 16 interactions that overlap inter and intra genic regions, we consider
the null hypothesis that the number of interactions within open reading frames is
proportional to the amount of the genome within open reading frames, 0.72 [1]. Again,
this would be the expectation if the positions of the interactions followed a Poisson
process. This null hypothesis is rejected (p=7.7 x 10-6; z-test for single proportion,
with null value 0.72, with continuity correction, as implemented in R function
“prop.test”). If the 16 interactions on the gene borders are included as inter genic, ther
result is merely suggestive (p=0.097). Confidence intervals for the two cases are
(0.858-0.973) and (0.706, 0.865) respectively. In the case where all interactions are
included, the result when the overlapping interactions are pooled with the intergenic
ones is significant (p=0.028).
2) Statistical analysis of phenanthroline treated samples.
The phenanthroline data set of 136 interactions with flanking restriction sites was
analysed as for the exponential growth interaction set. In this case, the full set of
interactions had 153 events, and the set eliminating chromosome 10 had 132 events,
but none of the test conclusions changed.
The Kolmogorov-Smirnov test of uniformity failed to reject the null hypothesis
(p=0.52).
The Kolmogorov-Smirnov test of exponential inter-event distances, using observed
mean 80772 bp, was significant (p=0.001). There is an excess of very short distances.
The Chi-Squared goodness of fit test (this time testing against a Poisson distribution
with parameter as 136/80=1.70), was non significant (Chi-Squared statistic 6.11,
p=0.19), indicating small scale clustering is not a major feature of this data set. The
table of observed and expected values is given below.
Number of
hits
Expected
no. of
intervals
with this
many hits
Observed
no. of
intervals
with this
many hits
0
1
2
3
>=4
14.61
24.84
21.12
11.97
7.46
17
30
17
6
10
Disregarding the interactions that overlap inter and intra genic regions, a test of the
null hypothesis that the proportion of hits that are intra genic = 0.72 gives p= 1.1 x 105
.
Including the 13 gene border interactions in the inter-genic group, the p-value is 0.024
(z-test for single proportion, with null value 0.72, with continuity correction, as
implemented in R function “prop.test”). The corresponding confidence intervals for
the proportion of intragenic interactions are (0.832, 0.946) and (0.732,0.870). Note
these intervals have substantial overlap with those obtained for the exponentially
growing yeast, indicating the two conditions yield similar proportions of intra-genic
interactions.
Below are the results of an analysis which included interactions on chromosome 10
and those which were not directly flanked by MspI sites. There were only minor
differences between these two analyses.
3) Statistical analysis of chromosome distribution in exponentially growing cells.
Chromosomes were concatenated and normalized to length 1; the positions of the 123
interactions were converted to this scale.
The Kolomgorov-Smirnov test was used to detect any large scale differences from the
null hypothesis that the positions of the interactions are uniformly distributed across
the genome (ie large regions with an elevated rate of interactions). We fail to reject
the null hypothesis (p=0.19).
A Chi-Squared goodness of fit test was used to detect small scale clumping of the
interactions. The Poisson process implies a Poisson distribution for the number of
observations in an interval of fixed width. The genome was divided into 80 intervals
of equal length. The null hypothesis for the test is that the number of interactions in
each interval should be Poisson with mean 123/80=1.54. The test requires that, under
the null hypothesis, the number of intervals with 0, 1, 2, 3, or ≥4 interactions exceeds
5, otherwise further combining of the categories is required. The division of the
genome into 80 intervals was chosen as a convenient number of intervals that met this
requirement. The observed and expected number of intervals, with the given number
of interactions, are shown in the table below.
Number of
hits
Expected
no. of
intervals
with this
many hits
Observed
no. of
intervals
with this
many hits
0
1
2
3
>=4
17.19
26.43
20.32
10.41
5.63
31
17
14
9
9
The Chi-squared statistic (sum over cells of [(observed-expected)2/ expected]) is
18.62, to be compared to a Chi-squared distribution with 4 degrees of freedom. The pvalue is 0.001, so the null hypothesis is rejected.
We see there is an excess of intervals with zero hits, and of cells with >=4 hits. There
is actually one cell on chromosome 12 that has 17 hits, which is exceedingly unlikely
in a Poisson distribution with mean 1.54. Thus, when examined on a fine scale, the
genome has excess regions without interactions, and excess regions of dense
interactions compared to a Poisson process.
Disregarding the 16 interactions that overlap inter and intra genic regions, we consider
the null hypothesis that the number of interactions within open reading frames is
proportional to the amount of the genome within open reading frames, 0.72 [6]. Again,
this would be the expectation if the positions of the interactions followed a Poisson
process. This null hypothesis is rejected (p=1.3 x 10-6; z-test for single proportion,
with null value 0.72, with continuity correction, as implemented in R function
“prop.test”). It is also rejected if the 16 interactions on the gene borders are included
as inter genic (p=0.028). Confidence intervals for the two cases are (0.865, 0.971) and
(0.731, 0.875) respectively.
4) Statistical analysis of phenanthroline treated samples.
The phenanthroline data set was analysed as for the exponential growth interaction
set.
The Kolmogorov-Smirnov test gave a p-value of less than 1.0 x10-15, indicating there
are differences in the rate of interactions over different (large scale) genome regions.
The Chi-Squared goodness of fit test (this time testing against a Poisson distribution
with parameter as 153/80=1.91), was non significant (Chi-Squared statistic 6.44,
p=0.17), indicating small scale clustering is not a major feature of this data set. The
table of observed and expected values is given below.
Number of
hits
Expected
no. of
intervals
with this
many hits
Observed
no. of
intervals
with this
many hits
0
1
2
3
>=4
11.85
22.63
21.61
13.76
10.16
14
29
18
7
12
Disregarding the interactions that overlap inter and intra genic regions, a test of the
null hypothesis that the proportion of hits that are intra genic = 0.72 gives p= 1.0 x 10 6
.
Including the gene border interactions in the inter-genic group, the p-value is 0.016 (ztest for single proportion, with null value 0.72, with continuity correction, as
implemented in R function “prop.test”). The corresponding confidence intervals for
the proportion of intragenic interactions are (0.848, 0.952) and (0.737, 0.867). Note
these intervals have substantial overlap with those obtained for the exponentially
growing yeast, indicating the two conditions yield similar proportions of intra-genic
interactions.
1.
Dujon B: The yeast genome project: what did we learn? Trends Genet
1996, 12(7):263-270.
Download