advertisement

Practice Exercises (SOLUTIONS) The following exercises are designed primarily for programming practice in R. However, we will motivate the problems by considering the following: 1) A researcher has identified genetic structure that she believes is conserved throughout the genome. In order to determine the probability that this structure arose by chance, she generates many random sequences of the same length, with marginal probabilities for each nucleotide based on their empirical probabilities. 2) A researcher is studying promoter regions that are rich in guanine, and, from a list of candidate promoters, wants to look at all sequences where guanine content is greater than 30%. 1) Generating random sequences – a. Create a function that generates a single random nucleotide X where P(X = “G”) = 0.30, P(X = “A”) = 0.20, P(X = “C”) = 0.25, and P(X = “T”) = 0.25 Hint: You may want to use the runif() function to do this. generateSingleN = function() { u = runif(1) if (u < .30) return ("G") if (u < .50) return ("A") if (u < .75) return ("C") return ("T") } b. Using the function you have created in (a), create another function that generates a random nucleotide sequence of length n. generateN = function(n) { s = rep(0,n) for (i in 1:n) { s[i] = generateSingleN() } return (s) } c. Generate a random nucleotide sequence of length 100 using the sample() function, where the probability of each nucleotide is given in (a) Hint: type ‘?sample’ for more information. dna = c("G", "A", "C", "T") sample(dna, 100, replace = TRUE, prob = c(.3, .2, .25, .25)) 2) Sequence analysis – a. Load the object ‘sequences’ using the following command: load(url(‘http://www.public.iastate.edu/~gdancik/summer2007/files/sequences .RData’)) to get a data.frame of dna sequences, which has the name ‘sequences’. Each column contains a 40-base nucleotide sequence. For example, sequences[,1] will return the first sequence (as a factor) b. Since the columns of sequences are factors, summary(sequences) will tell you the number of each nucleotide in each column. However, suppose that we did not know this. Modify the countN1 or countN2 functions to take a single sequence, and return a vector of 4 elements that corresponds to the number of A’s, G’s, C’s, and T’s in the sequence. (Note: you will need to remove the toupper() function, since we are now working with factors, and not characters). countN2 = function(x) { numA = length( x[ x == "A"] numG = length( x[ x == "G"] numC = length( x[ x == "C"] numT = length( x[ x == "T"] return ( c(numA, numG, numC, } ) ) ) ) numT) ) c. Use the apply function to return a 4 x 10 matrix with the number of A’s, G’s, C’s, and T’s in each of the 10 sequences. apply(sequences, 2, countN2)