Sampling/For Loops/Recoding

advertisement
CSSS 508: Intro to R
1/20/06
Lecture 3
Sampling/Loops/Recoding Variables
Sampling:
a) From your dataset:
You often have a matrix where each row is a subject that has been asked several
questions (the columns). If we want a random sample from our dataset, we use the
sample ( ) function to generate a vector of random row numbers and then select that
subset from the matrix.
For example, say the below dataset had 64 rows and 12 columns.
We want to select 10 random people.
> random.sample<-sample(seq(1,nrow(data)),10)
> random.sample
[1] 52 2 25 54 27 8 40 11 49 17
> sample.subset<-data[random.sample,]
> dim(sample.subset)
[1] 10 12
b) From a distribution:
Often we want to sample data from a specific distribution, also sometimes called
simulating data. This data is usually used to test some algorithm or function that
someone has written. Since the data is simulated, you know where it came from and so
what the answer should be from your algorithm or function. Simulated data lets you
double-check your work.
Each distribution has 4 functions associated with it: r--, d--, p--, and q--.
(For example, rnorm( ), dnorm( ), pnorm( ), and qnorm( )).
You must specify the parameters for the distribution.
The r-- function simulates random data from the distribution.
10 observations from a normal distribution with mean 0 and stdev 1:
> rnorm(10,0,1)
[1] 0.1872828 -0.1160541
[7] 1.1122024 -0.3928728
1.8812873
0.7963980
1.0428904 -1.5879228
1.2735349
1.1440760
6 observations from a normal distribution with mean 3 and stdev 2:
> rnorm(6,3,2)
[1] 3.618939 5.129774 2.553895 0.715022 7.506615 3.347618
Rebecca Nugent, Department of Statistics, U. of Washington
-1-
The d-- function finds the density value of the number/vector you plug in.
The density value of 3.5 in a normal with mean 3 and stdev 1:
> dnorm(3.5,3,1)
[1] 0.3520653
The density value of 0.5 in a normal with mean 3 and stdev 1:
> dnorm(0.5,3,1)
[1] 0.0175283
The density values of a vector in a normal with mean 0 and stdev 1:
> dnorm(c(-3,-2,-1,0,1,2,3),0,1)
[1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725
0.053990967 0.004431848
The p-- function gives the distribution function; that is, it gives the probability of being
at or below the number/vector you plug in for that distribution. P(X <= x).
The probability of being <=3 in a normal with mean 0 and stdev 1:
> pnorm(3,0,1)
[1] 0.9986501
The probability of being <=-3 in a normal with mean 0 and stdev 1:
> pnorm(-3,0,1)
[1] 0.001349898
Note: you can get the probability of being greater than a number by 1–p-The probability of being >12 in a normal with mean 10 and stdev 3:
> 1-pnorm(12,10,3)
[1] 0.2524925
The q-- function gives the quantile function, or what number marks the x-th percentile of
a specific distribution. P(X <= ?) = given percentile.
The 50th percentile of a normal with mean 0 and stdev 1:
> qnorm(0.50,0,1)
[1] 0
The 10th, 25th, 75th, and 90th percentile of a normal with mean -1 and stdev 0.5:
> qnorm(c(0.10,0.25,0.75,0.90),-1,0.5)
[1] -1.6407758 -1.3372449 -0.6627551 -0.3592242
There are R functions for the following distributions:
Beta, binomial, Cauchy, chi-squared, exponential, F, gamma, geometric, hypergeometric,
log-normal, logistic, negative binomial, normal, Poisson, student’s t, uniform, Weibull,
Wilcoxon.
(Next page handout lists their names and the necessary parameters.)
Rebecca Nugent, Department of Statistics, U. of Washington
-2-
Distributions
beta
binomial
Cauchy
chi-squared
exponential
F
gamma
geometric
hypergeometric
log-normal
logistic
negative binomial
normal
Poisson
Student’s t
uniform
Weibull
Wilcoxon
R Name
beta
binom
cauchy
chisq
exp
f
gamma
geom.
hyper
lnorm
logis
nbinom
norm
pois
t
unif
weibull
wilcox
Rebecca Nugent, Department of Statistics, U. of Washington
Additional Arguments
shape1, shape2, ncp
size, prob
location, scale
df, ncp
rate
df1, df2, ncp
shape, scale
prob
m, n, k
meanlog, sdlog
location, scale
size, prob
mean, sd
lambda
df, ncp
min,max
shape, scale
m, n
-3-
Examples of other distributions:
Flipping a coin: (Binomial distribution)
If we use the rbinom( ), the n argument is how many trials, the size is how many coins,
and the p is the probability of getting a heads:
Flip one coin once.
> rbinom(n=1,size=1,p=.5)
[1] 0
Flip three coins ten times; the results are the number of heads we saw each time.
> rbinom(n=10,size=3,p=.5)
[1] 2 1 1 2 0 3 2 1 3 3
We can change the coin to have a smaller chance of success.
> rbinom(n=10,size=3,p=.2)
[1] 2 1 0 1 1 0 0 1 1 1
What is the probability of seeing 7 heads if we flip 12 coins?
> dbinom(7,12,.5)
[1] 0.1933594
Uniform distribution:
Random samples from [0,1]:
> runif(12,0,1)
[1] 0.69738202 0.15147387 0.60034879 0.70218089 0.19314468 0.19987450
[7] 0.28603845 0.08926752 0.33122900 0.28059597 0.10723647 0.38926535
or from the unit square: [0,1] by [0,1]:
> cbind(runif(8,0,1),runif(8,0,1))
[,1]
[,2]
[1,] 0.8091289 0.02334477
[2,] 0.2980009 0.95909988
[3,] 0.9597524 0.56358745
[4,] 0.6610231 0.22847434
[5,] 0.1445462 0.82469317
[6,] 0.6264433 0.71810215
[7,] 0.9222504 0.40311884
[8,] 0.4051854 0.97956278
Poisson Distribution:
What is the probability of getting 4 phone calls in the next hour if on average you receive
6 phone calls an hour?
> dpois(4,6)
[1] 0.1338526
Exponential Distribution:
What is the probability that a light bulb with an average lifetime of 200 hours burns out
before 100 hours?
> pexp(100,1/200)
Rebecca Nugent, Department of Statistics, U. of Washington
-4-
[1] 0.3934693
For loops:
Often we need to repeat an action several times – sometimes over subjects in a dataset.
> for(i in 1:n){
+
the action to be repeated
+ }
for indicates that we’re going to loop
i is the index we’re looping over
1 is our start index
n is the end index
{ opens the loop; } closes the loop.
> index<-NULL
> for(i in 1:4){
+ index<-c(index,i)
+ }
> index
[1] 1 2 3 4
from a start index to an end index.
(Initiates the variable; assigns NULL value)
Looping over a dataset:
> data<-cbind(rnorm(10,0,1),rnorm(10,3,1),rnorm(10,6,1),rnorm(10,9,1))
A different way of initiating the variables
> mean.vec<-sd.vec<-rep(0,4)
Looping over each column:
> for(i in 1:4){
+ mean.vec[i]<-mean(data[,i])
+ sd.vec[i]<-sd(data[,i])
+ }
> mean.vec
[1] 0.8071171 2.9904124 5.8173064 9.1275691
> sd.vec
[1] 1.0747551 0.9355267 0.6220746 0.7322954
You don’t have to loop over a sequence. It can be any vector of numbers.
> loop.vector<-c(3,5,7,12,20)
> for(i in loop.vector){
+ cat("i=",i,"\n")
+ }
i= 3
i= 5
i= 7
i= 12
i= 20
The cat( ) function prints a list in order. “\n” indicates a new line.
Rebecca Nugent, Department of Statistics, U. of Washington
-5-
Can loop over a selected sample of rows in your dataset:
> sample.vec<-sample(seq(1,40),10)
> sample.vec
[1] 31 28 18 1 8 20 25 11 30 29
>
>
+
+
mean.vec<-rep(0,10)
for(i in 1:length(sample.vec)){
mean.vec[i]<-mean(data[sample.vec[i],])
}
While Loops:
If you’re not sure how many loops you need, you can use a while loop.
while( true/false statement) {
Repeated action
}
R checks the true/false statement; if true, it does another loop. If false, the loop stops.
> i<-1
> while(i <= 6){
+ cat("i=",i,"\n")
+ i<-i+1
+ }
i= 1
i= 2
i= 3
i= 4
i= 5
i= 6
We need an initializing statement to start the while loop: (i<-1). If we did not have the
i<-i+1 statement, i would always be 1, and the loop would be infinite.
> x<-rnorm(1,0,1)
> while(x<0){
+ cat(x)
+ x<-rnorm(1,0,1)
+ }
-0.572205-0.09479107
> x
[1] 0.5720156
> x<-rnorm(1,0,1)
> while(x<0.5){
+ cat(x)
+ x<-rnorm(1,0,1)
+ }
-1.043909-0.4560672-0.1760368-0.8520931-0.5805316
> x
Rebecca Nugent, Department of Statistics, U. of Washington
-6-
[1] 1.865334
Recoding Variables:
Often we want to create another variable based on recoding a variable we already have.
For example, recoding a continuous variable as a categorical variable.
We can do this with a for loop.
>
>
+
+
+
+
+
age.cat<-rep(0,n)
for(i in 1:n){
if(age[i]>10 & age[i]<=20) age.cat[i]<-1
if(age[i]>20 & age[i]<=30) age.cat[i]<-2
if(age[i]>30 & age[i]<=40) age.cat[i]<-3
etc
}
>
>
+
+
+
+
+
gen.race<-NULL
for(i in 1:n){
if(gender[i]=="m"&(race[i]==1|race[i]==2))
if(gender[i]=="f"&(race[i]==1|race[i]==2))
if(gender[i]=="m"&(race[i]==3|race[i]==4))
if(gender[i]=="f"&(race[i]==3|race[i]==4))
}
gen.race<-c(gen.race,1)
gen.race<-c(gen.race,2)
gen.race<-c(gen.race,3)
gen.race<-c(gen.race,4)
We can also do this with a conditional statement.
(Need to initialize the space ahead of time – can’t do age.cat<-NULL.)
> age.cat<-rep(0,n)
> age.cat[age>10 & age<=20]<-1
> age.cat[age>20 & age<=30]<-2
> age.cat[age>30 & age<=40]<-3
etc
> race.cat<-rep(0,n)
> race.cat[race==1|race==2]<-1
> race.cat[race==3|race==4]<-2
etc
Rebecca Nugent, Department of Statistics, U. of Washington
-7-
Download