RExercisesSol544

advertisement
Practice Exercises (SOLUTIONS)
The following exercises are designed primarily for programming practice in R. However,
we will motivate the problems by considering the following:
1) A researcher has identified genetic structure that she believes is conserved
throughout the genome. In order to determine the probability that this structure
arose by chance, she generates many random sequences of the same length, with
marginal probabilities for each nucleotide based on their empirical probabilities.
2) A researcher is studying promoter regions that are rich in guanine, and, from a list
of candidate promoters, wants to look at all sequences where guanine content is
greater than 30%.
1) Generating random sequences –
a. Create a function that generates a single random nucleotide X where
P(X = “G”) = 0.30, P(X = “A”) = 0.20, P(X = “C”) = 0.25, and
P(X = “T”) = 0.25
Hint: You may want to use the runif() function to do this.
generateSingleN = function() {
u = runif(1)
if (u < .30) return ("G")
if (u < .50) return ("A")
if (u < .75) return ("C")
return ("T")
}
b. Using the function you have created in (a), create another function that
generates a random nucleotide sequence of length n.
generateN = function(n) {
s = rep(0,n)
for (i in 1:n) {
s[i] = generateSingleN()
}
return (s)
}
c. Generate a random nucleotide sequence of length 100 using the
sample() function, where the probability of each nucleotide is given in (a) Hint:
type ‘?sample’ for more information.
dna = c("G", "A", "C", "T")
sample(dna, 100, replace = TRUE, prob = c(.3, .2, .25, .25))
2) Sequence analysis –
a. Load the object ‘sequences’ using the following command:
load(url(‘http://www.public.iastate.edu/~gdancik/summer2007/files/sequences
.RData’))
to get a data.frame of dna sequences, which has the name ‘sequences’. Each
column contains a 40-base nucleotide sequence. For example, sequences[,1]
will return the first sequence (as a factor)
b. Since the columns of sequences are factors, summary(sequences) will tell
you the number of each nucleotide in each column. However, suppose that we did
not know this. Modify the countN1 or countN2 functions to take a single
sequence, and return a vector of 4 elements that corresponds to the number of A’s,
G’s, C’s, and T’s in the sequence. (Note: you will need to remove the toupper()
function, since we are now working with factors, and not characters).
countN2 = function(x) {
numA = length( x[ x == "A"]
numG = length( x[ x == "G"]
numC = length( x[ x == "C"]
numT = length( x[ x == "T"]
return ( c(numA, numG, numC,
}
)
)
)
)
numT) )
c. Use the apply function to return a 4 x 10 matrix with the number of A’s, G’s, C’s,
and T’s in each of the 10 sequences.
apply(sequences, 2, countN2)
Download