Christopher Knapp University of Akron, Fall 2011 Statistical Data Management Project #3 Problem Statement Problem #1: Perform in SAS Access a data file named Proj3_prob1.txt from the R:Drive. It contains 10 measure (or variables) and a classification variable on the first 35 observations. The objective is to use this data (the first 35 observations) to construct a classification function which will be able to classify new observations into group 1 or 2. The method we will use is to construct W, where: 1 𝑊 = [𝑋 𝑇 𝑆𝑝−1 (𝑋̅1 − 𝑋̅2 )] − [ (𝑋̅1 − 𝑋̅2 )𝑇 𝑆𝑝−1 (𝑋̅1 − 𝑋̅2 )] 2 Where X is a vector of the 10 measures on each person we are trying to classify. 𝑋̅1 is a vector of means for the 10 measures of the observations in group 1. 𝑋̅2 is a vector of means for the 10 measures of the observations in group 2. Sp is the pooled variance covariance matrix of the 10 measures on the observations in groups 1 and 2. The decision rule is: Assign X to group 1 if W > 0, otherwise assign X to group 2. Please classify the last five observations in the file, observations 36-40. Print your results. Please email and print your SAS PROC IML code. Problem #2: Perform in R As you all know, there are two forms of the two sample T-test statistic. The one is used when the variance can be considered equal and the other one is an approximation when the variance cannot be considered equal. Example 1: 𝑡= (𝑥̅1 – 𝑥̅2 ) − (𝜇1 − 𝜇2 ) 𝑆𝑝 √( 1 1 + ) 𝑛1 𝑛2 𝑆𝑝2 = (𝑛1 − 1)𝑆12 + (𝑛2 − 1)𝑆22 𝑛1 + 𝑛2 − 2 𝑑𝑓 = 𝑛1 + 𝑛2 − 2 Example 2: 2 𝑡= (𝑥̅1 – 𝑥̅ 2 ) − (𝜇1 − 𝜇2 ) 𝑆2 𝑆2 √( 1 + 2 ) 𝑛1 𝑛2 𝑑𝑓 ≈ 𝑆2 𝑆2 ( 1 + 2) 𝑛1 𝑛2 2 2 𝑆2 𝑆2 ( 1) ( 2) 𝑛1 𝑛2 + (𝑛1 − 1) (𝑛2 − 1) We would like to test how well these statistics perform in different situations. We are going to perform the hypothesis test: Ho: µ1 - µ2 = 0 vs. Ha: µ1 - µ2 ≠ 0 . a) Get two samples of size 18 from N(µ=30, σ2=100) and N(µ=40, σ2=100) distributions. Calculate the two test statistics and determine their p-values. Do this test 10000 times. Determine the fraction of times the p-value is greater than α = .05 for each approach. What conclusion can you make? b) Get one sample of size 15 from the N(µ=50, σ2=16) distribution, and another sample of size 18 from the N(µ=45, σ2=25) distribution . Calculate the two test statistics and determine their p-values. Do this test 10000 times. Determine the fraction of times the p-value is greater than α = .05 for each approach. What conclusion can you make? Provide all necessary output and print and email your R code. Problem #3: Perform in R Access a data file named Proj3_prob3.txt from the R:Drive. Provide comment and provide all necessary output and print your R code for each part. Also, email your R code. a) Compute an estimate 𝜌̂ for the correlation coefficient ρ of X1 and X2 using the data. What was the formula you used for this calculation? b) Compute the mean of all bootstrap means and discuss. You may wish to look at some plots of the distributions of the bootstrap replicates. c) Compute the bootstrap estimate for the bias using B=1000 replications. d) Compute the bootstrap estimate for the standard error using B=1000 replications. e) Compute the 95% Bootstrap t Confidence Interval for 𝜌 using B=1000 bootstrap replicates. f) Compute the 95% Bootstrap Percentile Confidence Interval for 𝜌 using B=1000 bootstrap replicates. g) Compute the jackknife estimate for the bias. h) Compute the jackknife biased corrected estimate for 𝜌. i) Compute the Jackknife t Confidence Interval for 𝜌. j) Comment on the performance of the three confidence interval methods in parts e), f) and i). Problem #4 Graduate Students Only: Perform in R We often describe our emotional reaction to social rejection as “pain.” A clever study suggests that the “pain” caused by social rejection really is pain, in the sense that it causes activity in brain areas known to be activated by physical pain. Here are data for 13 subjects on degree of social distress and extent of brain activity: Make a scatterplot of brain activity against social distress. There is a positive linear association, with correlation r = 0.878. Is this correlation significantly greater than 0? Use a permutation test using 1000 permuted samples. Construct a histogram of all the permuted correlation statistics. Provide all necessary output and print and email your R code. Contents Problem 1 Page 1 Page 2 Output Code Problem 2 Page 5 Page 5 Output Code Problem 3 Page 9 Page 9 Page 10 Page 11 Page 11 Page 12 Page 12 Data Import Part A: Estimate for ρ Part B, C, D: The Bootstrap Distribution Part E: 95% Bootstrap t Confidence Interval Part F: 95% Bootstrap Percentile Confidence Interval Part G, H, I: Jackknife Part J: Confidence Interval Discussion Problem 4 Page 13 Page 14 Scatterplot Permutation Distribution Problem 1 Project 1 Problem 1 Output The second column is the value for W and the third column is the classification. P a g e |1 Project 1 Problem 1 Code /* DATA IMPORT */ DATA prob1_X; INPUT var1 - var10 classification; DATALINES; 2 1 3.55 410 0 17 43 61 129 2 1 2.7 390 0 20 50 47 60 2 1 3.5 510 0 22 47 79 119 3 1 2.91 430 0 13 24 40 100 2 1 3.1 600 0 16 47 60 79 3 1 3.49 610 0 28 57 59 99 1 0 3.17 610 0 14 42 61 92 2 1 3.57 560 0 10 42 79 107 3 1 3.76 700 1 28 69 83 156 2 0 3.81 460 1 30 48 67 110 2 0 3.6 590 1 28 59 74 116 3 0 3.1 500 1 15 21 40 49 1 1 3.08 410 0 24 52 71 107 2 1 3.5 470 1 15 35 40 125 2 1 3.43 210 1 26 35 57 64 2 0 3.39 610 0 16 59 58 100 2 0 3.76 510 1 25 68 66 138 3 0 3.71 600 0 3 38 58 63 2 1 3.00 470 1 5 45 24 82 2 0 3.47 460 0 16 37 48 73 2 1 3.69 800 1 28 54 100 132 1 1 3.24 610 0 13 45 83 87 2 1 3.46 490 0 9 31 70 89 2 0 3.39 470 0 13 39 48 99 2 0 3.9 610 1 30 67 85 119 1 0 2.76 580 0 10 30 14 100 2 1 2.7 410 0 13 19 55 84 1 1 3.77 630 1 8 71 100 166 2 1 4.00 790 1 29 80 94 111 3 1 3.4 490 0 17 47 45 110 2 0 3.09 400 0 15 46 58 93 2 1 3.8 610 1 16 59 90 141 1 1 3.28 610 1 13 48 84 99 1 1 3.7 500 1 30 68 81 114 2 1 3.42 430 1 17 43 49 96 3 1 3.09 540 0 17 31 54 39 1 1 3.7 610 0 25 64 87 149 2 1 2.69 400 0 10 19 36 53 3 1 3.4 390 0 23 43 51 39 1 0 2.95 490 0 18 20 59 91 ; RUN; DATA prob1_c1; SET prob1_X; WHERE classification=1; RUN; DATA prob1_c2; 3 1 1 1 2 1 3 2 1 1 1 1 5 1 5 1 2 1 3 3 2 2 2 1 2 1 2 3 2 1 1 2 2 5 1 1 4 3 1 1 2 1 1 2 1 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 2 2 1 1 2 1 1 2 1 1 0 0 0 0 0 P a g e |2 Project 1 Problem 1 SET prob1_X; WHERE classification=2; RUN; DATA prob1_c0; SET prob1_X; WHERE classification=0; RUN; PROC IML; /* READ DATA INTO X, C1, C2, C0*/ USE Prob1_X; READ ALL INTO X; USE Prob1_c1; READ ALL VAR { var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 } INTO C1; USE Prob1_x; READ ALL VAR { var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 } INTO Xnoclass; USE Prob1_c2; READ ALL VAR { var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 } INTO C2; USE Prob1_c0; READ ALL INTO C0; /* GET XBAR1 AND XBAR2 */ xbar1 = t(C1[+,]/23); xbar2 = t(C2[+,]/12); /* COMPUTE CORRELATION MATRIX corr_c1 */ sum=c1[+,]; n1=nrow(c1); step1=t(c1)*c1-t(sum)*sum/n1; step2=diag(1/sqrt(vecdiag(step1))); corr_c1=step2*step1*step2; /* COMPUTE STDEV1 */ temp1=C1; DO col=1 TO 10; DO row=1 TO n1; temp1[row,col]=(temp1[row,col]-xbar1[col])**2; END; END; stdev1 = sqrt(t(temp1[+,]/(n1-1))); /* COMPUTE CORRELATION MATRIX corr_c2 */ sum=c2[+,]; n2=nrow(c2); step1=t(c2)*c2-t(sum)*sum/n2; step2=diag(1/sqrt(vecdiag(step1))); corr_c2=step2*step1*step2; P a g e |3 Project 1 Problem 1 P a g e |4 /* COMPUTE STDEV2 */ temp2=C2; DO col=1 TO 10; DO row=1 TO n2; temp2[row,col]=(temp2[row,col]-xbar2[col])**2; END; END; stdev2 = sqrt(t(temp2[+,]/(n2-1))); /* COMPUTE VAR/COV MATRIX S1 */ S1 = j(10,10,0); DO i=1 TO 10; DO j=1 TO 10; S1[i,j]=corr_c1[i,j]*stdev1[i]*stdev1[j]; END; END; /* COMPUTE VAR/COV MATRIX S2 */ S2 = j(10,10,0); DO i=1 TO 10; DO j=1 TO 10; S2[i,j]=corr_c2[i,j]*stdev2[i]*stdev2[j]; END; END; /* COMPUTE W */ W=j(40,1,0); /* http://www.palass.org/modules.php?name=palaeo_math&page=17 */ Sp = ((n1-1)*S1+(n2-1)*S2)/(n1+n2-2); DO obs=1 TO 40; W[obs] = (xnoclass[obs,]*inv(Sp)*(xbar1-xbar2)); W[obs] = W[obs] - (.5*t(xbar1-xbar2)*inv(Sp)*(xbar1-xbar2)); END; /* COMPUTE OUTPUT */ final = X[,11] || ((W<0)+1) || (X[,11]=((W<0)+1)); DO obs=36 TO 40; output = obs || w[obs] || ((W[obs]<0)+1); print output; END; RUN; Problem 2 Project 1 Problem 2 P a g e |5 Output Part a: Part b: Code ####### Problem 1a ####### # # # Parameters # # # n1 <- 18 # mu1 <- 30 # sig1 <- 10 # n2 <- 18 # mu2 <- 40 # sig2 <- 10 # sims <- 10000 # # # ############################################################################### # # # Initialize Matrices # # xvalues1 <- c(sims*n1) xvalues2 <- c(sims*n1) xbar1 <- c(sims) # xbar2 <- c(sims) # s1 <- c(sims) # s2 <- c(sims) # teststatpooled <- c(sims) # teststatunpooled <- c(sims) # pvaluepooled <- c(sims) # # # # Project 1 Problem 2 pvalueunpooled # P a g e |6 <- c(sims) # # # ############################################################################### # # # Program Body # # # xvalues1 xvalues2 <- matrix(rnorm(sims*n1, mu1, sig1), nrow=sims, ncol=n1) <- matrix(rnorm(sims*n2, mu2, sig2), nrow=sims, ncol=n2) xbar1 xbar2 <- apply(xvalues1,1,mean) <- apply(xvalues2,1,mean) for (i in 1:sims) { s1[i]=sd(xvalues1[i,]) s2[i]=sd(xvalues2[i,]) } pooledtotal unpooledtotal <- 0 <- 0 #will count number of p-values > .05 for (i in 1:sims) { spooled <- sqrt(((n1-1)*s1[i]*s1[i]+(n2-1)*s2[i]*s2[i])/(n1+n2-2)) dfpooled <- n1+n2-2 dfunpooled_n <- (((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2))**2) dfunpooled_d1 <- ((s1[i]*s1[i]/n1)**2)/(n1-1) dfunpooled_d2 <- ((s2[i]*s2[i]/n2)**2)/(n2-1) dfunpooled <- dfunpooled_n/(dfunpooled_d1+dfunpooled_d2) teststatpooled teststatunpooled <- (xbar1-xbar2)/(spooled*sqrt((1/n1)+(1/n2))) <- (xbar1-xbar2)/sqrt((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2)) pvaluepooled[i] <- 2*(1-pt(abs(teststatpooled[i]),dfpooled)) pvalueunpooled[i] <- 2*(1-pt(abs(teststatunpooled[i]),dfunpooled)) if (pvaluepooled[i] > .05) { pooledtotal=pooledtotal+1; } if (pvalueunpooled[i] > .05) { unpooledtotal=unpooledtotal+1; } } # # ############################################################################### # # # Program Conclusion # # # pooledtotal/10000 unpooledtotal/10000 Project 1 Problem 2 P a g e |7 # # ############################################################################### ####### Problem 1b ####### # # # Parameters # # # n1 <- 15 # mu1 <- 50 # sig1 <- 4 # n2 <- 18 # mu2 <- 45 # sig2 <- 5 # sims <- 10000 # # # ############################################################################### # # # Initialize Matrices # # xvalues1 <- c(sims*n1) xvalues2 <- c(sims*n1) xbar1 <- c(sims) # xbar2 <- c(sims) # s1 <- c(sims) # s2 <- c(sims) # teststatpooled <- c(sims) # teststatunpooled <- c(sims) # pvaluepooled <- c(sims) pvalueunpooled <- c(sims) # # # # # # # # ############################################################################### # # # Program Body # # # xvalues1 xvalues2 <- matrix(rnorm(sims*n1, mu1, sig1), nrow=sims, ncol=n1) <- matrix(rnorm(sims*n2, mu2, sig2), nrow=sims, ncol=n2) xbar1 xbar2 <- apply(xvalues1,1,mean) <- apply(xvalues2,1,mean) for (i in 1:sims) { s1[i]=sd(xvalues1[i,]) s2[i]=sd(xvalues2[i,]) } Project 1 pooledtotal unpooledtotal Problem 2 <- 0 <- 0 P a g e |8 #will count number of p-values > .05 for (i in 1:sims) { spooled <- sqrt(((n1-1)*s1[i]*s1[i]+(n2-1)*s2[i]*s2[i])/(n1+n2-2)) dfpooled <- n1+n2-2 dfunpooled_n <- (((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2))**2) dfunpooled_d1 <- ((s1[i]*s1[i]/n1)**2)/(n1-1) dfunpooled_d2 <- ((s2[i]*s2[i]/n2)**2)/(n2-1) dfunpooled <- dfunpooled_n/(dfunpooled_d1+dfunpooled_d2) teststatpooled teststatunpooled <- (xbar1-xbar2)/(spooled*sqrt((1/n1)+(1/n2))) <- (xbar1-xbar2)/sqrt((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2)) pvaluepooled[i] <- 2*(1-pt(abs(teststatpooled[i]),dfpooled)) pvalueunpooled[i] <- 2*(1-pt(abs(teststatunpooled[i]),dfunpooled)) if (pvaluepooled[i] > .05) { pooledtotal=pooledtotal+1; } if (pvalueunpooled[i] > .05) { unpooledtotal=unpooledtotal+1; } } # # ############################################################################### # # # Program Conclusion # # # pooledtotal/10000 unpooledtotal/10000 # # ############################################################################### Problem 3 Project 2 Problem 3 Data Import setwd("R:/Fridline/Statistical Data Management/Project #3") getwd() bootdata <- read.table("Proj3_prob3.txt", header=T, sep=" ") PART A: Estimate for ρ The test statistic (.29932) is displayed in the output below. corr_calculation = c(400) x1bar = mean(bootdata[,1]) x2bar = mean(bootdata[,2]) sumofx = 0 sumofy = 0 sumofx2 = 0 sumofy2 = 0 sumofxy = 0 for (i in 1:400) { sumofx = sumofx + bootdata[i,1] sumofy = sumofy + bootdata[i,2] sumofx2 = sumofx2 + (bootdata[i,1]**2) sumofy2 = sumofy2 + (bootdata[i,2]**2) sumofxy = sumofxy + bootdata[i,1]*bootdata[i,2] } r_numerator = (400*sumofxy) - (sumofx*sumofy) r_denominator1 = (400*sumofx2) - ((sumofx)**2) r_denominator2 = (400*sumofy2) - ((sumofy)**2) r_denominator = sqrt(r_denominator1*r_denominator2) r = r_numerator/r_denominator r Page |9 Project 2 Problem 3 P a g e | 10 PART B, C, D: The Bootstrap Distribution The histogram displays the boostrap distribution. Note that it appears approximately normally distributed and symmetric. The output below displays the mean (.29854), bias (-.00078), and standard error (.05043) of the bootstrap distribution. sim=1000 simulatedvalues <- c(sim) for(i in 1:sim) { n <- 400 index <- sample(1:400,400,r=T) X1 <- bootdata[index,1] Y1 <- bootdata[index,2] Xsq <- X1^2 Ysq <- Y1^2 XY <- X1*Y1 Xcume <- sum(X1) Ycume <- sum(Y1) Xsqcume <- sum(Xsq) Ysqcume <- sum(Ysq) XYcume <- sum(XY) r <- (XYcume - (((Xcume)*(Ycume))/n))/(sqrt(Xsqcume - (Xcume^2/n))*sqrt(Ysqcume - (Ycume^2/n))) simulatedvalues[i] <- r } mean(simulatedvalues) bias = mean(simulatedvalues) - .2993151 bias sd(simulatedvalues) hist(simulatedvalues) PART E: 95% Bootstrap t Confidence Interval The 95% Bootstrap t confidence interval is (.20017, .39846). Due to the low bias and the approximate normality of the bootstrap distribution, this is a good estimate of the 95% confidence interval of the sampling distribution. Project 2 Problem 3 P a g e | 11 # 95% confidence interval UB = originalr + (qt(.975,399)*sd(simulatedvalues)) LB = originalr - (qt(.975,399)*sd(simulatedvalues)) UB LB PART F: 95% Bootstrap Percentile Confidence Interval The 95% Bootstrap percentile confidence interval is (.19275, .38115). Notice this is fairly close to the 95% Bootstrap t confidence interval. # 95% percentile interval sortedvalues = sort(simulatedvalues) LBpercentile = sortedvalues[25] UBpercentile = sortedvalues[950] LBpercentile UBpercentile PART G, H, I: Jackknife Project 2 Problem 3 P a g e | 12 x1 = bootdata[,1] y1 = bootdata[,2] n = 400 jack <- rep(0,n) for(j in 1:n) { sampx <- x1[-j] sampy <- y1[-j] jack[j] = cor(sampx,sampy) } mean(jack) originalr meanjack = mean(jack) biasjack = 399*(meanjack - originalr) biasjack #g biasedcorrect = originalr - biasjack biasedcorrect #h se = sqrt(((399)/400)*sum((jack-meanjack)**2)) LB = biasedcorrect - (qt(.975,399)*se) UB = biasedcorrect + (qt(.975,399)*se) LB #i UB #i PART J: Confidence Interval Discussion The confidence intervals were very close to each other, distributions are approximately normal, and bias is small, so the confidence intervals are fairly accurate.s Problem 4 Project 2 Problem 4 Scatterplot data = c(1, 1.26, -.055, 2, 1.85, -.04, 3, 1.1, -.026, 4, 2.5, -.017, 5, 2.17, -.017, 6, 2.67, .017, 7, 2.01, .021, 8, 2.18, .025, 9, 2.58, .027, 10, 2.75, .033, 11, 2.75, .064, 12, 3.33, .077, 13, 3.65, .124) dataset =t(matrix(data,ncol=13)) plot(dataset[,2],dataset[,3],xlab="Social Distress", ylab="Brain Activity") cor(dataset[,2],dataset[,3]) Permutation Distribution P a g e | 13 Project 2 Problem 4 P a g e | 14 Since the v-value is 0, at any significance level, there is enoguh evidence to conclude the correlation is larger than 0. data = c(1, 1.26, -.055, 2, 1.85, -.04, 3, 1.1, -.026, 4, 2.5, -.017, 5, 2.17, -.017, 6, 2.67, .017, 7, 2.01, .021, 8, 2.18, .025, 9, 2.58, .027, 10, 2.75, .033, 11, 2.75, .064, 12, 3.33, .077, 13, 3.65, .124) dataset =t(matrix(data,ncol=13)) plot(dataset[,2],dataset[,3],xlab="Social Distress", ylab="Brain Activity") teststat = cor(dataset[,2],dataset[,3]) corvals = c(1000) for(i in 1:1000) { index <- sample(1:13,13,r=F) newactivity <- dataset[index,3] corvals[i] = cor(dataset[,2],newactivity) } hist(corvals) mean(corvals) sd(corvals) pvalue = 0 for(i in 1:1000) { pvalue = pvalue + (teststat<corvals[1]) } pvalue = pvalue/1000 pvalue