Spring 2008 - Stat C141/ Bioeng C141 - Statistics for Bioinformatics Course Website: http://www.stat.berkeley.edu/users/hhuang/141C-2008.html Section Website: http://www.stat.berkeley.edu/users/mgoldman GSI Contact Info: Megan Goldman mgoldman@stat.berkeley.edu Office Hours: 342 Evans M 10-11, Th 3-4, and by appointment 1 Pencil and Paper stuff Odds Say you’re told that the odds of a horse winning a race are 5:4 against. The ”meaning” of this statement is that for every 4 chances the horse has of winning, there are 5 chances to lose. Let’s say that event A is that the horse wins, and P(A) = p. We can obtain the value of p by converting the odds into a fraction, i.e. use 54 rather than 5:4, and use the following identity: 5 1−p = p 4 From there, you can do the algebra and learn that the horse has a 49 = .4444 chance of winning the race. Sometimes, you’ll be told the ”odds for”, say, a horse has odds for winning of 3:2. In that case, take the reciprocal of the formula above: 3 p = 1−p 2 Changes of Location and Scale Consider some data, X. You’ve done some calculations and know that X has mean x̄ and variance v. Now suppose we want to change X in some way, either by adding a constant to all the values (change of location) or by multiplying all values by a constant (change of scale). How will this affect the mean and variance? Data X X +a bX Mean x̄ x̄ + a bx̄ Variance v v b2 v 1 2 R stuff Probabilities and such from known distributions R has a lot of the commonly used distributions already built in. First, let’s look at a few functions useful for the normal distribution: dnorm(x) will give you the density of the normal distribution at some value, x. This function has some additional arguments which are optional. The most commonly used ”other” arguments are mean and sd. If you don’t set these arguments, R will assume a mean of 0 and sd of 1, better known as standard normal. If you have, say, a mean of 3 and an sd of 1.4, and you want the density at 2.25, the correct function would be: dnorm(2.25, mean = 3, sd = 1.4). pnorm(x) will give you the cumulative distribution, i.e., the area under the curve up until value x. The mean and sd can be set, same as above. Note that the cumulative distribution gives you, by default, the area to the left of your value of interest. Many times, you’ll want the area to the right. There are two ways to get this: 1 - pnorm(x) works, or there is also an argument, lower.tail, that can be set. lower.tail defaults to T, so to get the upper tail, use something like pnorm(x, lower.tail = T). qnorm(p) will give you the value of the p-th quantile. So, qnorm(.25) will give you the value of X that has 25% of the distribution smaller than it. The lower.tail argument is also available here, if you want to start counting percentiles from the largest value down. Mean and sd are also available arguments. rnorm(n) will give you n random values from the normal distribution. The mean and the SD are settable as arguments. Of course, normal isn’t the only distribution in the statistical world. The conventions in R are to use d as the first letter for densities, p for cumulative distributions, q for quantiles, and r for random values. What follows depends on the distribution. Here are a few common ones: Normal - norm Lognormal - lnorm Student’s t - t F-f χ2 - chisq Binomial - binom Poisson - pois Uniform - unif Exponential - exp Gamma - gamma Beta - beta 2 Note that some of these distributions have additional required arguments, like degrees of freedom. See the help files for details! Try a few for yourself: You have a normal distribution with mean 5 and sd 2. What’s the chance of observing a value larger than 8? Smaller than 4? Between 2 and 3? The answers are 0.0668072, 0.3085375, and 0.09184805, respectively. You have a χ2 distribution with 3 degrees of freedom. What value does the variable take at the 10th percentile? The 95th? The answers here are 0.5843744 and 7.814728. A loopy example The central limit theorem claims that the means of variables follow a normal distribution, even if the variables themselves are nowhere near normal. Consider an exponential distribution with rate 3. First, try generating 500 random observations from this distribution, and looking at a histogram of those values: hist(rexp(500, rate = 3)) Nowhere near normal, eh? Well, to convince ourselves that the central limit theorem really does work, let’s try taking the mean of 500 random exponentials. For good measure, let’s do this 200 times over... > > + > means <- NULL for(i in 1:200) means <- c(means, mean(rexp(500, rate = 3))) hist(means) Let’s break down more carefully what I just did: means = NULL - I’ve declared some variable, means, and set it’s value to NULL. A NULL value means there’s nothing there, at least to start. Note that this is NOT the same as 0. I’ll get to that in a few lines. I needed to declare this before I started the loop for reasons also explained below. for(i in 1:200) - Execute the following code 200 times. In this case, there’s just one line of code I want repeated a bunch of times. If you have more than one line of code to repeat, enclose all of them inside a pair of curly brackets { }. The loop will execute everything inside the brackets, in order, then loop back to the top of code inside the brackets and start again. means = c(means, mean(rexp(500, rate = 3))) - There’s a lot going on here. rexp(500, rate = 3) says I want 500 random observations from an exponential distribution with rate 3. mean(·) asks for the mean of those observations. Note I’m not storing the observations themselves anywhere... I only care about the mean, so that’s all I’m going to keep. c(means, 3 mean...) says to take the variable means, and tack the value of the mean of the 500 variables on to the end. Here’s where the earlier declaration means = NULL becomes important. Note that I’m taking the means variable, adding one more value onto the end, and then calling it means! Using c(means, ...) assumes that the object means already exists. If I hadn’t declared the means variable before starting this loop, I’d have gotten an error message complaining that I’d called on a variable that R doesn’t know about yet.. If I’d done means = 0 at the top, I’d have 0 as the first entry and then the 200 different means after it, which would mess up my data. So, NULL was the right choice here. hist(means) is outside the loop. Once I’ve gotten all 200 means, this asks for a histogram. This looks pretty normal, doesn’t it? Superimposing a curve on top of your histogram The exponential distribution is well known to have mean λ1 and variance λ12 , where λ is 1 the rate parameter. So, the mean of 500 exponentials will have mean λ1 and variance 500λ 2. 1 1 Let’s try taking a normal distribution with mean 3 and variance 4500 and laying it on top of the histogram we just drew. > > > > x <- seq(.28,.38,by = .001) y <- dnorm(x, mean = 1/3, sd = sqrt(1/4500)) hist(means, freq = F) points(x,y,type = "l") The first line, I’m creating a variable called x. It has a sequence of points from .28 to .38 in increments of .001. So, if you look at x, it will contain .28, .281, .282, and so on, up to .38. I chose these values based on a quick glance at my histogram and observing what the smallest and largest values seem to be. The second line is taking those values of x and finding the the height of the normal distribution with our given mean and variance for each of them. Note that since sd is an available argument, but not variance, I just took the square root of the variance. The third line is calling the histogram again. I want to make sure that we’re using densities, not frequencies, on the y-axis. The normal curve I want to put on top of it is using densities, and we want the same scale. I called it again because the following command, points, expects that the graphic you’re laying the points on was just called. points(x,y, type = ”l”) is the command to put additional points on an existing graphic. The first argument are your x values, the second the y values. type = ”l” says I want a line connecting these points (the default is to have an open circle at each point). Viola! You have a curve on top of your histogram. Booleans The following is lifted from the ”Introduction to R” tutorial (see my website for link.) As well as numerical vectors, R allows manipulation of logical quantities. The elements of a logical vector can have the values TRUE, FALSE, and NA (for ”not available”, see below). The first two are often abbreviated as T and F, respectively. Note however that T 4 and F are just variables which are set to TRUE and FALSE by default, but are not reserved words and hence can be overwritten by the user. Hence, you should always use TRUE and FALSE. Logical vectors are generated by conditions. For example > temp <- x > 13 sets temp as a vector of the same length as x with values FALSE corresponding to elements of x where the condition is not met and TRUE where it is. The logical operators are <, <=, >, >=, == for exact equality and != for inequality. In addition if c1 and c2 are logical expressions, then c1 & c2 is their intersection (”and”), c1 | c2 is their union (”or”), and !c1 is the negation of c1. Logical vectors may be used in ordinary arithmetic, in which case they are coerced into numeric vectors, FALSE becoming 0 and TRUE becoming 1. (end borrowing from tutorial) To practice, let’s use something like Problem 7 in your current homework. We have A, C, G, T, where P(A) = .2, P(C) = .3, P(G) = .3, and P(T ) = .2 Let’s create a vector of 50 observations, and see how many times we find AA. > nuc <- c("A", "C", "G", "T") This is creating a vector containing the four letters. Note that they’re in quotes. If you use a text that’s not in quotes, R is going to assume you’re looking for a variable that you’ve named A somewhere earlier. > vec <- sample(nuc, 50, replace = T, prob = c(.2, .3, .3, .2)) Now we’ve created a vector called vec. The sample function here has said we’re sampling from ”nuc”, 50 times. Replace = T means we’re sampling with replacement. Note that, in this case, we couldn’t sample without replacement... we’d run out of items after 4! prob lets you set a vector, the same length as the vector you’re sampling from, containing the probability of observing each entry. > Count <- 0 > for(i in 1:49) + ifelse(vec[i] == "A" & vec[i+1] == "A", Count <- Count + 1, Count <- Count) Here, I’ve needed a variable called Count before I started my loop. Here, I set it to 0 to start. Since the first AA will increase the count to 1, the second to 2, and so on, we’re fine starting with 0. Note that in the for() portion of the code, I’m using 1:49. Since we’re looking at pairs, if I tried to start with the 50th observation, there wouldn’t be a second letter after it to look at. ifelse() is the good old if-then-else formulation you’ve probably used in other coding. The ”if” portion here is in Boolean form. vec[i] says to pick off the ith item in my vector, vec. The Boolean string here will return true if the ith entry is ”A” AND the (i+1)th entry is A. When it’s true, I want Count to increase by 1. If it’s not true, Count should stay the same. 5 Saving your code and reloading it later When I have a project that I’m working on a bit at a time, or where I’ll be asked by the instructor to include my code, I save it as I’m going. I generally find it easier to have a separate file open. When R is open, under the File drop-down, there’s an option for New Document. Opening this will give you a new document (duh!). You can type code in this document and use ”save as” to assign it a name and save it somewhere. Also in the File drop-down, there are options for Open Document and Source File. Open Document will open a document for editing, if you want to add or change code. Source File will take the file and open it in the R window, running the code. So, what I generally do is Open the document, make whatever changes, save it, and then Source File. If there are more changes needed, I make them, save, source, and so on. What’s in my workspace? And yow, is it a mess! Can’t remember what you called that vector, or if you’ve even created it yet? The ls() command (no arguments needed) will show you everything in your workspace. Got a messy workspace and want to clean it up a bit? rm() will remove a given object from your workspace. The argument for this is the name of the object to remove. Make sure you don’t actually need whatever you’re removing. If you want to just clear everything off your workspace, say you’ve finished Problem 1 and don’t need any of it saved for use in Problem 2, in the ”Workspace” dropdown there’s an option for Clear Workspace. It’ll ask if you’re sure, then completely clean off your workspace. There’s also a Show Workspace option, which may be a little friendlier than ls() and rm() for examining and cleaning your workspace. Quitting R Since I forgot this last time... to quit R, just type quit(). No arguments are needed, but you do need to have the open and close parentheses. You’ll get a box asking if you want to save your workspace, the usual answer is no. 6