Suzanne, please add this sentence: Linear, blunt

advertisement
Supplementary Materials and Methods
Computational analyses
The bindpat algorithm was developed to identify any patterns of a k-mer sequence in the
genome. It performs “exact set matching” for a segment of sequence on a given (longer)
DNA sequence (Gusfield, 1997). The consensus sequence may contain letters such as Y,
R, W, S (=G/C) and N (=A/C/G/T), which stand for more than one possible nucleotide.
Expressed in terms of nucleotides, the consensus sequence may be viewed as a set of
patterns; a set match is said to exist if any one of the patters is a perfect match. For
example, the consensus sequence RRRCWWGYYY represents a set of 28 = 256 possible
nucleotide patterns or segments.
The physical distance in base pairs between each consecutive identified segment was
determined from the last base of one segment to the first base of the next segment, the
spacer length. The observed relative frequency (occurrence rate) of a specific spacer
length for each chromosome was calculated as the proportion of the spacers with the
given length out of all spacers found in that chromosome. The average observed relative
frequency of spacer length j was computed as
1
ckjc/Nc, where kjc is the number of
C
spacers of j bp on chromosome c, Nc is the total number of spacers on chromosome c, and
C is the total number of chromosomes examined. Expected spacer length frequencies for
each chromosome were calculated as follows: the geometric distribution was
approximated by the exponential density function of × exp(-j) (Ewens and Grant,
2001), where the parameter  = K/L, K is the number of spacers found in the chromosome
of length L bp, and j is the spacer length in base pairs, j = 0, 1, …, (L-20). The expected
frequencies averaged over all chromosomes were then calculated. When the genome was
randomly shuffled, these calculated expected frequencies were in agreement with actual
observed numbers (data not shown).
The expected number of p53 10-mer DNA binding motifs in the human genome is
approximately 3×109 ×(1/4)2 × (1/2)8 = 732,422, and the motif actually appears 750,231
times, so the p-value based on the binomial distribution B (Robin et al., 2002) is
Pr{B[3×109, (1/4)2 × (1/2)8]≥750,231} = 10-95. Given that the total number of p53 DNA
binding motifs (non-overlapping) found in the genome is fixed, 2 tests of 1 degree of
freedom were performed on each of the 10,001 spacer lengths (0 to 10,000 bp), [no. of
observed relative frequencies – no. of expected]2/[no. of expected]. A Bonferroni
adjustment for multiple comparisons was then applied by multiplying all the p-values by
2.7 ×109, the length of the human genome.
Wavelet shrinkage was employed to smooth the discontinuities (de-noise) in the
frequency functions. The theoretical underpinning of wavelet shrinkage was developed
by Donoho and Johnstone (Donoho and Johnstone, 1995) and has been incorporated in
the S+Wavelets Module (StatSci Division of MathSoft). In our application of the p53
signals, the median of absolute deviation from the median was used to compute the noise
scale and the number of levels to be shrunk was chosen to be 8.
We downloaded the genome sequence for the human genome NCBI Build 36.1,
http://genome/ucsc/edu/.
The 53REs with long spacers are distributed unevenly across the genome. Extreme
examples are the 24-bp spacers on an amplified portion of chromosome 20 with a very
high occurrence rate of 66 copies found among a total of 17,264 spacers on this
chromosome (0.382% vs. 0.027%, observed vs. expected relative frequencies,
respectively) and the 129 bp spacers on chromosome 17 with a very low relative
frequency of 0.0098% (0.023% expected). The highest occurrence of the peaks for all six
spacer lengths resides on chromosomes 20 and Y (occurrence per length of the
chromosome) and the lowest frequencies are found in chromosomes 19 and 22.
Sequence comparisons
The high-frequency spacers including the flanking p53RE half-site sequences (20 base
pairs) were obtained from the bindpat program and were BLAST’ed against each other
using the BLASTn program. The subset of those sequences sharing at least 80% identity
were then BLAST’ed against the available consensus sequences for known repeating
subfamilies (RepBase7.40 - RepBase14.04, ref. (Jurka, 2000)) as well as a survey of
retrovirus and retrotransposon genes. Each of the five sizes of spacer regions 14, 24, 129,
1895, and 3056 bp, in p53REs were compared for at least 80% identity to the consensus
sequences for 640 repeat elements. Percent identity is calculated as the total number of
matched bases divided by the total length of the match in base pairs. Only matches
covering the entire length of the spacer and p53RE (10 base pairs flanking the spacer)
were considered for the calculation of percent identity.
Sequence similarity measures were computed via  =
n
n 1

xixjij, where xi and xj are
i, j
the proportions of total number of sequences of type i and j, and ij is the proportion of
nucleotides that differ between the two types of sequences. Thus, a lower  value
indicates a higher sequence similarity, and vice versa.
Evolution of L1
For the past 30 million years, there have been successive waves of L1 replication in the
human genome and in the ancestors to the human genome. In each case the wave of
amplification represented a single dominant subfamily, distinguishable by characteristic
polymorphisms. For unknown reasons, each wave has subsided, giving way to a new
wave of integrations from a new dominant subfamily, derived by specific substitutions
from the previous subfamily. Amplification of the youngest subfamily of L1 began 2.5 to
4 million years ago, well after the human-chimpanzee split, and is therefore humanspecific. However, other waves preceded the primate radiation, and so we compared
consensus sequences of the six youngest L1 subfamilies (Khan et al., 2006), dating back
40 million years, to determine the presence or absence of the common p53-binding sites.
References
Donoho, DL, Johnstone, IM (1995). Adapting to unknown smoothness via wavelet
shrinkage. Journal of the American Statistical Association 90: 1200-1224.
Ewens, WJ, Grant, G (2001). Statistical methods in bioinformatics : an introduction.
Springer: New York.
Gusfield, D (1997). Algorithms on strings, trees, and sequences : computer science and
computational biology. Cambridge University Press: Cambridge.
Jurka, J (2000). Repbase update: a database and an electronic journal of repetitive
elements. Trends Genet 16: 418-420.
Khan, H, Smit, A, Boissinot, S (2006). Molecular evolution and tempo of amplification
of human LINE-1 retrotransposons since the origin of primates. Genome Res 16: 78-87.
Robin, S, Daudin, JJ, Richard, H, Sagot, MF, Schbath, S (2002). Occurrence probability
of structured motifs in random sequences. J Comput Biol 9: 761-773.
Download