CSSS 508: Intro to R 1/20/06 Lecture 3 Sampling/Loops/Recoding Variables Sampling: a) From your dataset: You often have a matrix where each row is a subject that has been asked several questions (the columns). If we want a random sample from our dataset, we use the sample ( ) function to generate a vector of random row numbers and then select that subset from the matrix. For example, say the below dataset had 64 rows and 12 columns. We want to select 10 random people. > random.sample<-sample(seq(1,nrow(data)),10) > random.sample [1] 52 2 25 54 27 8 40 11 49 17 > sample.subset<-data[random.sample,] > dim(sample.subset) [1] 10 12 b) From a distribution: Often we want to sample data from a specific distribution, also sometimes called simulating data. This data is usually used to test some algorithm or function that someone has written. Since the data is simulated, you know where it came from and so what the answer should be from your algorithm or function. Simulated data lets you double-check your work. Each distribution has 4 functions associated with it: r--, d--, p--, and q--. (For example, rnorm( ), dnorm( ), pnorm( ), and qnorm( )). You must specify the parameters for the distribution. The r-- function simulates random data from the distribution. 10 observations from a normal distribution with mean 0 and stdev 1: > rnorm(10,0,1) [1] 0.1872828 -0.1160541 [7] 1.1122024 -0.3928728 1.8812873 0.7963980 1.0428904 -1.5879228 1.2735349 1.1440760 6 observations from a normal distribution with mean 3 and stdev 2: > rnorm(6,3,2) [1] 3.618939 5.129774 2.553895 0.715022 7.506615 3.347618 Rebecca Nugent, Department of Statistics, U. of Washington -1- The d-- function finds the density value of the number/vector you plug in. The density value of 3.5 in a normal with mean 3 and stdev 1: > dnorm(3.5,3,1) [1] 0.3520653 The density value of 0.5 in a normal with mean 3 and stdev 1: > dnorm(0.5,3,1) [1] 0.0175283 The density values of a vector in a normal with mean 0 and stdev 1: > dnorm(c(-3,-2,-1,0,1,2,3),0,1) [1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725 0.053990967 0.004431848 The p-- function gives the distribution function; that is, it gives the probability of being at or below the number/vector you plug in for that distribution. P(X <= x). The probability of being <=3 in a normal with mean 0 and stdev 1: > pnorm(3,0,1) [1] 0.9986501 The probability of being <=-3 in a normal with mean 0 and stdev 1: > pnorm(-3,0,1) [1] 0.001349898 Note: you can get the probability of being greater than a number by 1–p-The probability of being >12 in a normal with mean 10 and stdev 3: > 1-pnorm(12,10,3) [1] 0.2524925 The q-- function gives the quantile function, or what number marks the x-th percentile of a specific distribution. P(X <= ?) = given percentile. The 50th percentile of a normal with mean 0 and stdev 1: > qnorm(0.50,0,1) [1] 0 The 10th, 25th, 75th, and 90th percentile of a normal with mean -1 and stdev 0.5: > qnorm(c(0.10,0.25,0.75,0.90),-1,0.5) [1] -1.6407758 -1.3372449 -0.6627551 -0.3592242 There are R functions for the following distributions: Beta, binomial, Cauchy, chi-squared, exponential, F, gamma, geometric, hypergeometric, log-normal, logistic, negative binomial, normal, Poisson, student’s t, uniform, Weibull, Wilcoxon. (Next page handout lists their names and the necessary parameters.) Rebecca Nugent, Department of Statistics, U. of Washington -2- Distributions beta binomial Cauchy chi-squared exponential F gamma geometric hypergeometric log-normal logistic negative binomial normal Poisson Student’s t uniform Weibull Wilcoxon R Name beta binom cauchy chisq exp f gamma geom. hyper lnorm logis nbinom norm pois t unif weibull wilcox Rebecca Nugent, Department of Statistics, U. of Washington Additional Arguments shape1, shape2, ncp size, prob location, scale df, ncp rate df1, df2, ncp shape, scale prob m, n, k meanlog, sdlog location, scale size, prob mean, sd lambda df, ncp min,max shape, scale m, n -3- Examples of other distributions: Flipping a coin: (Binomial distribution) If we use the rbinom( ), the n argument is how many trials, the size is how many coins, and the p is the probability of getting a heads: Flip one coin once. > rbinom(n=1,size=1,p=.5) [1] 0 Flip three coins ten times; the results are the number of heads we saw each time. > rbinom(n=10,size=3,p=.5) [1] 2 1 1 2 0 3 2 1 3 3 We can change the coin to have a smaller chance of success. > rbinom(n=10,size=3,p=.2) [1] 2 1 0 1 1 0 0 1 1 1 What is the probability of seeing 7 heads if we flip 12 coins? > dbinom(7,12,.5) [1] 0.1933594 Uniform distribution: Random samples from [0,1]: > runif(12,0,1) [1] 0.69738202 0.15147387 0.60034879 0.70218089 0.19314468 0.19987450 [7] 0.28603845 0.08926752 0.33122900 0.28059597 0.10723647 0.38926535 or from the unit square: [0,1] by [0,1]: > cbind(runif(8,0,1),runif(8,0,1)) [,1] [,2] [1,] 0.8091289 0.02334477 [2,] 0.2980009 0.95909988 [3,] 0.9597524 0.56358745 [4,] 0.6610231 0.22847434 [5,] 0.1445462 0.82469317 [6,] 0.6264433 0.71810215 [7,] 0.9222504 0.40311884 [8,] 0.4051854 0.97956278 Poisson Distribution: What is the probability of getting 4 phone calls in the next hour if on average you receive 6 phone calls an hour? > dpois(4,6) [1] 0.1338526 Exponential Distribution: What is the probability that a light bulb with an average lifetime of 200 hours burns out before 100 hours? > pexp(100,1/200) Rebecca Nugent, Department of Statistics, U. of Washington -4- [1] 0.3934693 For loops: Often we need to repeat an action several times – sometimes over subjects in a dataset. > for(i in 1:n){ + the action to be repeated + } for indicates that we’re going to loop i is the index we’re looping over 1 is our start index n is the end index { opens the loop; } closes the loop. > index<-NULL > for(i in 1:4){ + index<-c(index,i) + } > index [1] 1 2 3 4 from a start index to an end index. (Initiates the variable; assigns NULL value) Looping over a dataset: > data<-cbind(rnorm(10,0,1),rnorm(10,3,1),rnorm(10,6,1),rnorm(10,9,1)) A different way of initiating the variables > mean.vec<-sd.vec<-rep(0,4) Looping over each column: > for(i in 1:4){ + mean.vec[i]<-mean(data[,i]) + sd.vec[i]<-sd(data[,i]) + } > mean.vec [1] 0.8071171 2.9904124 5.8173064 9.1275691 > sd.vec [1] 1.0747551 0.9355267 0.6220746 0.7322954 You don’t have to loop over a sequence. It can be any vector of numbers. > loop.vector<-c(3,5,7,12,20) > for(i in loop.vector){ + cat("i=",i,"\n") + } i= 3 i= 5 i= 7 i= 12 i= 20 The cat( ) function prints a list in order. “\n” indicates a new line. Rebecca Nugent, Department of Statistics, U. of Washington -5- Can loop over a selected sample of rows in your dataset: > sample.vec<-sample(seq(1,40),10) > sample.vec [1] 31 28 18 1 8 20 25 11 30 29 > > + + mean.vec<-rep(0,10) for(i in 1:length(sample.vec)){ mean.vec[i]<-mean(data[sample.vec[i],]) } While Loops: If you’re not sure how many loops you need, you can use a while loop. while( true/false statement) { Repeated action } R checks the true/false statement; if true, it does another loop. If false, the loop stops. > i<-1 > while(i <= 6){ + cat("i=",i,"\n") + i<-i+1 + } i= 1 i= 2 i= 3 i= 4 i= 5 i= 6 We need an initializing statement to start the while loop: (i<-1). If we did not have the i<-i+1 statement, i would always be 1, and the loop would be infinite. > x<-rnorm(1,0,1) > while(x<0){ + cat(x) + x<-rnorm(1,0,1) + } -0.572205-0.09479107 > x [1] 0.5720156 > x<-rnorm(1,0,1) > while(x<0.5){ + cat(x) + x<-rnorm(1,0,1) + } -1.043909-0.4560672-0.1760368-0.8520931-0.5805316 > x Rebecca Nugent, Department of Statistics, U. of Washington -6- [1] 1.865334 Recoding Variables: Often we want to create another variable based on recoding a variable we already have. For example, recoding a continuous variable as a categorical variable. We can do this with a for loop. > > + + + + + age.cat<-rep(0,n) for(i in 1:n){ if(age[i]>10 & age[i]<=20) age.cat[i]<-1 if(age[i]>20 & age[i]<=30) age.cat[i]<-2 if(age[i]>30 & age[i]<=40) age.cat[i]<-3 etc } > > + + + + + gen.race<-NULL for(i in 1:n){ if(gender[i]=="m"&(race[i]==1|race[i]==2)) if(gender[i]=="f"&(race[i]==1|race[i]==2)) if(gender[i]=="m"&(race[i]==3|race[i]==4)) if(gender[i]=="f"&(race[i]==3|race[i]==4)) } gen.race<-c(gen.race,1) gen.race<-c(gen.race,2) gen.race<-c(gen.race,3) gen.race<-c(gen.race,4) We can also do this with a conditional statement. (Need to initialize the space ahead of time – can’t do age.cat<-NULL.) > age.cat<-rep(0,n) > age.cat[age>10 & age<=20]<-1 > age.cat[age>20 & age<=30]<-2 > age.cat[age>30 & age<=40]<-3 etc > race.cat<-rep(0,n) > race.cat[race==1|race==2]<-1 > race.cat[race==3|race==4]<-2 etc Rebecca Nugent, Department of Statistics, U. of Washington -7-