1 Pencil and Paper stuff

advertisement
Spring 2008 - Stat C141/ Bioeng C141 - Statistics for Bioinformatics
Course Website: http://www.stat.berkeley.edu/users/hhuang/141C-2008.html
Section Website: http://www.stat.berkeley.edu/users/mgoldman
GSI Contact Info:
Megan Goldman
mgoldman@stat.berkeley.edu
Office Hours: 342 Evans M 10-11, Th 3-4, and by appointment
1
Pencil and Paper stuff
Odds
Say you’re told that the odds of a horse winning a race are 5:4 against. The ”meaning”
of this statement is that for every 4 chances the horse has of winning, there are 5 chances to
lose. Let’s say that event A is that the horse wins, and P(A) = p. We can obtain the value
of p by converting the odds into a fraction, i.e. use 54 rather than 5:4, and use the following
identity:
5
1−p
=
p
4
From there, you can do the algebra and learn that the horse has a 49 = .4444 chance of
winning the race.
Sometimes, you’ll be told the ”odds for”, say, a horse has odds for winning of 3:2. In
that case, take the reciprocal of the formula above:
3
p
=
1−p
2
Changes of Location and Scale
Consider some data, X. You’ve done some calculations and know that X has mean x̄
and variance v. Now suppose we want to change X in some way, either by adding a constant
to all the values (change of location) or by multiplying all values by a constant (change of
scale). How will this affect the mean and variance?
Data
X
X +a
bX
Mean
x̄
x̄ + a
bx̄
Variance
v
v
b2 v
1
2
R stuff
Probabilities and such from known distributions
R has a lot of the commonly used distributions already built in. First, let’s look at a few
functions useful for the normal distribution:
dnorm(x) will give you the density of the normal distribution at some value, x. This function has some additional arguments which are optional. The most commonly used ”other”
arguments are mean and sd. If you don’t set these arguments, R will assume a mean of 0 and
sd of 1, better known as standard normal. If you have, say, a mean of 3 and an sd of 1.4, and
you want the density at 2.25, the correct function would be: dnorm(2.25, mean = 3, sd = 1.4).
pnorm(x) will give you the cumulative distribution, i.e., the area under the curve up until
value x. The mean and sd can be set, same as above. Note that the cumulative distribution
gives you, by default, the area to the left of your value of interest. Many times, you’ll want
the area to the right. There are two ways to get this: 1 - pnorm(x) works, or there is also
an argument, lower.tail, that can be set. lower.tail defaults to T, so to get the upper tail,
use something like pnorm(x, lower.tail = T).
qnorm(p) will give you the value of the p-th quantile. So, qnorm(.25) will give you the
value of X that has 25% of the distribution smaller than it. The lower.tail argument is also
available here, if you want to start counting percentiles from the largest value down. Mean
and sd are also available arguments.
rnorm(n) will give you n random values from the normal distribution. The mean and the
SD are settable as arguments.
Of course, normal isn’t the only distribution in the statistical world. The conventions in
R are to use d as the first letter for densities, p for cumulative distributions, q for quantiles,
and r for random values. What follows depends on the distribution. Here are a few common
ones:
Normal - norm
Lognormal - lnorm
Student’s t - t
F-f
χ2 - chisq
Binomial - binom
Poisson - pois
Uniform - unif
Exponential - exp
Gamma - gamma
Beta - beta
2
Note that some of these distributions have additional required arguments, like degrees of
freedom. See the help files for details!
Try a few for yourself:
You have a normal distribution with mean 5 and sd 2. What’s the chance of observing
a value larger than 8? Smaller than 4? Between 2 and 3?
The answers are 0.0668072, 0.3085375, and 0.09184805, respectively.
You have a χ2 distribution with 3 degrees of freedom. What value does the variable take
at the 10th percentile? The 95th?
The answers here are 0.5843744 and 7.814728.
A loopy example
The central limit theorem claims that the means of variables follow a normal distribution,
even if the variables themselves are nowhere near normal. Consider an exponential distribution with rate 3. First, try generating 500 random observations from this distribution, and
looking at a histogram of those values:
hist(rexp(500, rate = 3))
Nowhere near normal, eh? Well, to convince ourselves that the central limit theorem
really does work, let’s try taking the mean of 500 random exponentials. For good measure,
let’s do this 200 times over...
>
>
+
>
means <- NULL
for(i in 1:200)
means <- c(means, mean(rexp(500, rate = 3)))
hist(means)
Let’s break down more carefully what I just did:
means = NULL - I’ve declared some variable, means, and set it’s value to NULL. A
NULL value means there’s nothing there, at least to start. Note that this is NOT the same
as 0. I’ll get to that in a few lines. I needed to declare this before I started the loop for
reasons also explained below.
for(i in 1:200) - Execute the following code 200 times. In this case, there’s just one line
of code I want repeated a bunch of times. If you have more than one line of code to repeat,
enclose all of them inside a pair of curly brackets { }. The loop will execute everything inside
the brackets, in order, then loop back to the top of code inside the brackets and start again.
means = c(means, mean(rexp(500, rate = 3))) - There’s a lot going on here. rexp(500,
rate = 3) says I want 500 random observations from an exponential distribution with rate
3. mean(·) asks for the mean of those observations. Note I’m not storing the observations
themselves anywhere... I only care about the mean, so that’s all I’m going to keep. c(means,
3
mean...) says to take the variable means, and tack the value of the mean of the 500 variables
on to the end. Here’s where the earlier declaration means = NULL becomes important. Note
that I’m taking the means variable, adding one more value onto the end, and then calling it
means! Using c(means, ...) assumes that the object means already exists. If I hadn’t declared
the means variable before starting this loop, I’d have gotten an error message complaining
that I’d called on a variable that R doesn’t know about yet.. If I’d done means = 0 at the
top, I’d have 0 as the first entry and then the 200 different means after it, which would mess
up my data. So, NULL was the right choice here.
hist(means) is outside the loop. Once I’ve gotten all 200 means, this asks for a histogram.
This looks pretty normal, doesn’t it?
Superimposing a curve on top of your histogram
The exponential distribution is well known to have mean λ1 and variance λ12 , where λ is
1
the rate parameter. So, the mean of 500 exponentials will have mean λ1 and variance 500λ
2.
1
1
Let’s try taking a normal distribution with mean 3 and variance 4500 and laying it on top of
the histogram we just drew.
>
>
>
>
x <- seq(.28,.38,by = .001)
y <- dnorm(x, mean = 1/3, sd = sqrt(1/4500))
hist(means, freq = F)
points(x,y,type = "l")
The first line, I’m creating a variable called x. It has a sequence of points from .28 to .38
in increments of .001. So, if you look at x, it will contain .28, .281, .282, and so on, up to
.38. I chose these values based on a quick glance at my histogram and observing what the
smallest and largest values seem to be.
The second line is taking those values of x and finding the the height of the normal
distribution with our given mean and variance for each of them. Note that since sd is an
available argument, but not variance, I just took the square root of the variance.
The third line is calling the histogram again. I want to make sure that we’re using
densities, not frequencies, on the y-axis. The normal curve I want to put on top of it is using
densities, and we want the same scale. I called it again because the following command,
points, expects that the graphic you’re laying the points on was just called.
points(x,y, type = ”l”) is the command to put additional points on an existing graphic.
The first argument are your x values, the second the y values. type = ”l” says I want a line
connecting these points (the default is to have an open circle at each point). Viola! You
have a curve on top of your histogram.
Booleans
The following is lifted from the ”Introduction to R” tutorial (see my website for link.)
As well as numerical vectors, R allows manipulation of logical quantities. The elements
of a logical vector can have the values TRUE, FALSE, and NA (for ”not available”, see
below). The first two are often abbreviated as T and F, respectively. Note however that T
4
and F are just variables which are set to TRUE and FALSE by default, but are not reserved
words and hence can be overwritten by the user. Hence, you should always use TRUE and
FALSE. Logical vectors are generated by conditions. For example
> temp <- x > 13
sets temp as a vector of the same length as x with values FALSE corresponding to
elements of x where the condition is not met and TRUE where it is.
The logical operators are <, <=, >, >=, == for exact equality and != for inequality. In
addition if c1 and c2 are logical expressions, then c1 & c2 is their intersection (”and”), c1
| c2 is their union (”or”), and !c1 is the negation of c1. Logical vectors may be used in
ordinary arithmetic, in which case they are coerced into numeric vectors, FALSE becoming
0 and TRUE becoming 1.
(end borrowing from tutorial)
To practice, let’s use something like Problem 7 in your current homework. We have A,
C, G, T, where P(A) = .2, P(C) = .3, P(G) = .3, and P(T ) = .2 Let’s create a vector of 50
observations, and see how many times we find AA.
> nuc <- c("A", "C", "G", "T")
This is creating a vector containing the four letters. Note that they’re in quotes. If you
use a text that’s not in quotes, R is going to assume you’re looking for a variable that you’ve
named A somewhere earlier.
> vec <- sample(nuc, 50, replace = T, prob = c(.2, .3, .3, .2))
Now we’ve created a vector called vec. The sample function here has said we’re sampling
from ”nuc”, 50 times. Replace = T means we’re sampling with replacement. Note that, in
this case, we couldn’t sample without replacement... we’d run out of items after 4! prob
lets you set a vector, the same length as the vector you’re sampling from, containing the
probability of observing each entry.
> Count <- 0
> for(i in 1:49)
+ ifelse(vec[i] == "A" & vec[i+1] == "A", Count <- Count + 1, Count <- Count)
Here, I’ve needed a variable called Count before I started my loop. Here, I set it to 0
to start. Since the first AA will increase the count to 1, the second to 2, and so on, we’re
fine starting with 0. Note that in the for() portion of the code, I’m using 1:49. Since we’re
looking at pairs, if I tried to start with the 50th observation, there wouldn’t be a second
letter after it to look at.
ifelse() is the good old if-then-else formulation you’ve probably used in other coding. The
”if” portion here is in Boolean form. vec[i] says to pick off the ith item in my vector, vec.
The Boolean string here will return true if the ith entry is ”A” AND the (i+1)th entry is
A. When it’s true, I want Count to increase by 1. If it’s not true, Count should stay the same.
5
Saving your code and reloading it later
When I have a project that I’m working on a bit at a time, or where I’ll be asked by
the instructor to include my code, I save it as I’m going. I generally find it easier to have
a separate file open. When R is open, under the File drop-down, there’s an option for New
Document. Opening this will give you a new document (duh!). You can type code in this
document and use ”save as” to assign it a name and save it somewhere.
Also in the File drop-down, there are options for Open Document and Source File. Open
Document will open a document for editing, if you want to add or change code. Source File
will take the file and open it in the R window, running the code. So, what I generally do
is Open the document, make whatever changes, save it, and then Source File. If there are
more changes needed, I make them, save, source, and so on.
What’s in my workspace? And yow, is it a mess!
Can’t remember what you called that vector, or if you’ve even created it yet? The
ls() command (no arguments needed) will show you everything in your workspace. Got a
messy workspace and want to clean it up a bit? rm() will remove a given object from your
workspace. The argument for this is the name of the object to remove. Make sure you
don’t actually need whatever you’re removing. If you want to just clear everything off your
workspace, say you’ve finished Problem 1 and don’t need any of it saved for use in Problem 2,
in the ”Workspace” dropdown there’s an option for Clear Workspace. It’ll ask if you’re sure,
then completely clean off your workspace. There’s also a Show Workspace option, which
may be a little friendlier than ls() and rm() for examining and cleaning your workspace.
Quitting R
Since I forgot this last time... to quit R, just type quit(). No arguments are needed, but
you do need to have the open and close parentheses. You’ll get a box asking if you want to
save your workspace, the usual answer is no.
6
Download