Suzanne, please add this sentence: Linear, blunt

Supplementary Materials and Methods Computational analyses The bindpat algorithm was developed to identify any patterns of a k-mer sequence in the genome. It performs “exact set matching” for a segment of sequence on a given (longer) DNA sequence (Gusfield, 1997). The consensus sequence may contain letters such as Y, R, W, S (=G/C) and N (=A/C/G/T), which stand for more than one possible nucleotide. Expressed in terms of nucleotides, the consensus sequence may be viewed as a set of patterns; a set match is said to exist if any one of the patters is a perfect match. For example, the consensus sequence RRRCWWGYYY represents a set of 28 = 256 possible nucleotide patterns or segments. The physical distance in base pairs between each consecutive identified segment was determined from the last base of one segment to the first base of the next segment, the spacer length. The observed relative frequency (occurrence rate) of a specific spacer length for each chromosome was calculated as the proportion of the spacers with the given length out of all spacers found in that chromosome. The average observed relative frequency of spacer length j was computed as 1 ckjc/Nc, where kjc is the number of C spacers of j bp on chromosome c, Nc is the total number of spacers on chromosome c, and C is the total number of chromosomes examined. Expected spacer length frequencies for each chromosome were calculated as follows: the geometric distribution was approximated by the exponential density function of × exp(-j) (Ewens and Grant, 2001), where the parameter  = K/L, K is the number of spacers found in the chromosome of length L bp, and j is the spacer length in base pairs, j = 0, 1, …, (L-20). The expected frequencies averaged over all chromosomes were then calculated. When the genome was randomly shuffled, these calculated expected frequencies were in agreement with actual observed numbers (data not shown). The expected number of p53 10-mer DNA binding motifs in the human genome is approximately 3×109 ×(1/4)2 × (1/2)8 = 732,422, and the motif actually appears 750,231 times, so the p-value based on the binomial distribution B (Robin et al., 2002) is Pr{B[3×109, (1/4)2 × (1/2)8]≥750,231} = 10-95. Given that the total number of p53 DNA binding motifs (non-overlapping) found in the genome is fixed, 2 tests of 1 degree of freedom were performed on each of the 10,001 spacer lengths (0 to 10,000 bp), [no. of observed relative frequencies – no. of expected]2/[no. of expected]. A Bonferroni adjustment for multiple comparisons was then applied by multiplying all the p-values by 2.7 ×109, the length of the human genome. Wavelet shrinkage was employed to smooth the discontinuities (de-noise) in the frequency functions. The theoretical underpinning of wavelet shrinkage was developed by Donoho and Johnstone (Donoho and Johnstone, 1995) and has been incorporated in the S+Wavelets Module (StatSci Division of MathSoft). In our application of the p53 signals, the median of absolute deviation from the median was used to compute the noise scale and the number of levels to be shrunk was chosen to be 8. We downloaded the genome sequence for the human genome NCBI Build 36.1, http://genome/ucsc/edu/. The 53REs with long spacers are distributed unevenly across the genome. Extreme examples are the 24-bp spacers on an amplified portion of chromosome 20 with a very high occurrence rate of 66 copies found among a total of 17,264 spacers on this chromosome (0.382% vs. 0.027%, observed vs. expected relative frequencies, respectively) and the 129 bp spacers on chromosome 17 with a very low relative frequency of 0.0098% (0.023% expected). The highest occurrence of the peaks for all six spacer lengths resides on chromosomes 20 and Y (occurrence per length of the chromosome) and the lowest frequencies are found in chromosomes 19 and 22. Sequence comparisons The high-frequency spacers including the flanking p53RE half-site sequences (20 base pairs) were obtained from the bindpat program and were BLAST’ed against each other using the BLASTn program. The subset of those sequences sharing at least 80% identity were then BLAST’ed against the available consensus sequences for known repeating subfamilies (RepBase7.40 - RepBase14.04, ref. (Jurka, 2000)) as well as a survey of retrovirus and retrotransposon genes. Each of the five sizes of spacer regions 14, 24, 129, 1895, and 3056 bp, in p53REs were compared for at least 80% identity to the consensus sequences for 640 repeat elements. Percent identity is calculated as the total number of matched bases divided by the total length of the match in base pairs. Only matches covering the entire length of the spacer and p53RE (10 base pairs flanking the spacer) were considered for the calculation of percent identity. Sequence similarity measures were computed via  = n n 1  xixjij, where xi and xj are i, j the proportions of total number of sequences of type i and j, and ij is the proportion of nucleotides that differ between the two types of sequences. Thus, a lower  value indicates a higher sequence similarity, and vice versa. Evolution of L1 For the past 30 million years, there have been successive waves of L1 replication in the human genome and in the ancestors to the human genome. In each case the wave of amplification represented a single dominant subfamily, distinguishable by characteristic polymorphisms. For unknown reasons, each wave has subsided, giving way to a new wave of integrations from a new dominant subfamily, derived by specific substitutions from the previous subfamily. Amplification of the youngest subfamily of L1 began 2.5 to 4 million years ago, well after the human-chimpanzee split, and is therefore humanspecific. However, other waves preceded the primate radiation, and so we compared consensus sequences of the six youngest L1 subfamilies (Khan et al., 2006), dating back 40 million years, to determine the presence or absence of the common p53-binding sites. References Donoho, DL, Johnstone, IM (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association 90: 1200-1224. Ewens, WJ, Grant, G (2001). Statistical methods in bioinformatics : an introduction. Springer: New York. Gusfield, D (1997). Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge University Press: Cambridge. Jurka, J (2000). Repbase update: a database and an electronic journal of repetitive elements. Trends Genet 16: 418-420. Khan, H, Smit, A, Boissinot, S (2006). Molecular evolution and tempo of amplification of human LINE-1 retrotransposons since the origin of primates. Genome Res 16: 78-87. Robin, S, Daudin, JJ, Richard, H, Sagot, MF, Schbath, S (2002). Occurrence probability of structured motifs in random sequences. J Comput Biol 9: 761-773.

Suzanne, please add this sentence: Linear, blunt

Related documents

Products

Support

Suzanne, please add this sentence: Linear, blunt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib