Project #3 - gozips.uakron.edu

advertisement
Christopher Knapp
University of Akron, Fall 2011
Statistical Data Management
Project #3
Problem Statement
Problem #1: Perform in SAS
Access a data file named Proj3_prob1.txt from the R:Drive. It contains 10 measure (or variables) and a classification variable on
the first 35 observations. The objective is to use this data (the first 35 observations) to construct a classification function which will
be able to classify new observations into group 1 or 2. The method we will use is to construct W, where:
1
𝑊 = [𝑋 𝑇 𝑆𝑝−1 (𝑋̅1 − 𝑋̅2 )] − [ (𝑋̅1 − 𝑋̅2 )𝑇 𝑆𝑝−1 (𝑋̅1 − 𝑋̅2 )]
2
Where X is a vector of the 10 measures on each person we are trying to classify.
𝑋̅1 is a vector of means for the 10 measures of the observations in group 1.
𝑋̅2 is a vector of means for the 10 measures of the observations in group 2.
Sp is the pooled variance covariance matrix of the 10 measures on the observations in groups 1 and 2.
The decision rule is: Assign X to group 1 if W > 0, otherwise assign X to group 2. Please classify the last five observations in the
file, observations 36-40. Print your results.
Please email and print your SAS PROC IML code.
Problem #2: Perform in R
As you all know, there are two forms of the two sample T-test statistic. The one is used when the variance can be considered equal and the
other one is an approximation when the variance cannot be considered equal.
Example 1:
𝑡=
(𝑥̅1 – 𝑥̅2 ) − (𝜇1 − 𝜇2 )
𝑆𝑝 √(
1
1
+ )
𝑛1 𝑛2
𝑆𝑝2 =
(𝑛1 − 1)𝑆12 + (𝑛2 − 1)𝑆22
𝑛1 + 𝑛2 − 2
𝑑𝑓 = 𝑛1 + 𝑛2 − 2
Example 2:
2
𝑡=
(𝑥̅1 – 𝑥̅ 2 ) − (𝜇1 − 𝜇2 )
𝑆2 𝑆2
√( 1 + 2 )
𝑛1 𝑛2
𝑑𝑓 ≈
𝑆2 𝑆2
( 1 + 2)
𝑛1 𝑛2
2
2
𝑆2
𝑆2
( 1)
( 2)
𝑛1
𝑛2
+
(𝑛1 − 1) (𝑛2 − 1)
We would like to test how well these statistics perform in different situations. We are going to perform the hypothesis test: Ho: µ1 - µ2 = 0
vs. Ha: µ1 - µ2 ≠ 0 .
a)
Get two samples of size 18 from N(µ=30, σ2=100) and N(µ=40, σ2=100) distributions. Calculate the two test statistics and
determine their p-values. Do this test 10000 times. Determine the fraction of times the p-value is greater than α = .05 for each
approach. What conclusion can you make?
b)
Get one sample of size 15 from the N(µ=50, σ2=16) distribution, and another sample of size 18 from the N(µ=45, σ2=25)
distribution . Calculate the two test statistics and determine their p-values. Do this test 10000 times. Determine the fraction of
times the p-value is greater than α = .05 for each approach. What conclusion can you make?
Provide all necessary output and print and email your R code.
Problem #3: Perform in R
Access a data file named Proj3_prob3.txt from the R:Drive. Provide comment and provide all necessary output and print
your R code for each part. Also, email your R code.
a)
Compute an estimate 𝜌̂ for the correlation coefficient ρ of X1 and X2 using the data. What was the formula
you used for this calculation?
b) Compute the mean of all bootstrap means and discuss. You may wish to look at some plots of the
distributions of the bootstrap replicates.
c)
Compute the bootstrap estimate for the bias using B=1000 replications.
d) Compute the bootstrap estimate for the standard error using B=1000 replications.
e)
Compute the 95% Bootstrap t Confidence Interval for 𝜌 using B=1000 bootstrap replicates.
f)
Compute the 95% Bootstrap Percentile Confidence Interval for 𝜌 using B=1000 bootstrap replicates.
g) Compute the jackknife estimate for the bias.
h) Compute the jackknife biased corrected estimate for 𝜌.
i)
Compute the Jackknife t Confidence Interval for 𝜌.
j)
Comment on the performance of the three confidence interval methods in parts e), f) and i).
Problem #4 Graduate Students Only: Perform in R
We often describe our emotional reaction to social rejection as “pain.” A clever study suggests that the “pain”
caused by social rejection really is pain, in the sense that it causes activity in brain areas known to be activated by
physical pain. Here are data for 13 subjects on degree of social distress and extent of brain activity:
Make a scatterplot of brain activity against social distress. There is a positive linear association, with correlation r =
0.878. Is this correlation significantly greater than 0? Use a permutation test using 1000 permuted samples. Construct a
histogram of all the permuted correlation statistics. Provide all necessary output and print and email your R code.
Contents
Problem 1
Page 1
Page 2
Output
Code
Problem 2
Page 5
Page 5
Output
Code
Problem 3
Page 9
Page 9
Page 10
Page 11
Page 11
Page 12
Page 12
Data Import
Part A: Estimate for ρ
Part B, C, D: The Bootstrap Distribution
Part E: 95% Bootstrap t Confidence Interval
Part F: 95% Bootstrap Percentile Confidence Interval
Part G, H, I: Jackknife
Part J: Confidence Interval Discussion
Problem 4
Page 13
Page 14
Scatterplot
Permutation Distribution
Problem 1
Project 1
Problem 1
Output
The second column is the value for W and the third column is the classification.
P a g e |1
Project 1
Problem 1
Code
/* DATA IMPORT */
DATA prob1_X;
INPUT
var1 - var10 classification;
DATALINES;
2
1 3.55 410
0 17 43 61 129
2
1 2.7
390
0 20 50 47 60
2
1 3.5
510
0 22 47 79 119
3
1 2.91 430
0 13 24 40 100
2
1 3.1
600
0 16 47 60 79
3
1 3.49 610
0 28 57 59 99
1
0 3.17 610
0 14 42 61 92
2
1 3.57 560
0 10 42 79 107
3
1 3.76 700
1 28 69 83 156
2
0 3.81 460
1 30 48 67 110
2
0 3.6
590
1 28 59 74 116
3
0 3.1
500
1 15 21 40 49
1
1 3.08 410
0 24 52 71 107
2
1 3.5
470
1 15 35 40 125
2
1 3.43 210
1 26 35 57 64
2
0 3.39 610
0 16 59 58 100
2
0 3.76 510
1 25 68 66 138
3
0 3.71 600
0
3 38 58 63
2
1 3.00 470
1
5 45 24 82
2
0 3.47 460
0 16 37 48 73
2
1 3.69 800
1 28 54 100 132
1
1 3.24 610
0 13 45 83 87
2
1 3.46 490
0
9 31 70 89
2
0 3.39 470
0 13 39 48 99
2
0 3.9
610
1 30 67 85 119
1
0 2.76 580
0 10 30 14 100
2
1 2.7
410
0 13 19 55 84
1
1 3.77 630
1
8 71 100 166
2
1 4.00 790
1 29 80 94 111
3
1 3.4
490
0 17 47 45 110
2
0 3.09 400
0 15 46 58 93
2
1 3.8
610
1 16 59 90 141
1
1 3.28 610
1 13 48 84 99
1
1 3.7
500
1 30 68 81 114
2
1 3.42 430
1 17 43 49 96
3
1 3.09 540
0 17 31 54 39
1
1 3.7
610
0 25 64 87 149
2
1 2.69 400
0 10 19 36 53
3
1 3.4
390
0 23 43 51 39
1
0 2.95 490
0 18 20 59 91
;
RUN;
DATA prob1_c1;
SET prob1_X;
WHERE classification=1;
RUN;
DATA prob1_c2;
3
1
1
1
2
1
3
2
1
1
1
1
5
1
5
1
2
1
3
3
2
2
2
1
2
1
2
3
2
1
1
2
2
5
1
1
4
3
1
1
2
1
1
2
1
1
1
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
2
2
1
2
2
1
1
2
1
1
2
1
1
0
0
0
0
0
P a g e |2
Project 1
Problem 1
SET prob1_X;
WHERE classification=2;
RUN;
DATA prob1_c0;
SET prob1_X;
WHERE classification=0;
RUN;
PROC IML;
/* READ DATA INTO X, C1, C2, C0*/
USE Prob1_X;
READ ALL INTO X;
USE Prob1_c1;
READ ALL VAR
{
var1 var2 var3 var4 var5
var6 var7 var8 var9 var10
}
INTO C1;
USE Prob1_x;
READ ALL VAR
{
var1 var2 var3 var4 var5
var6 var7 var8 var9 var10
}
INTO Xnoclass;
USE Prob1_c2;
READ ALL VAR
{
var1 var2 var3 var4 var5
var6 var7 var8 var9 var10
}
INTO C2;
USE Prob1_c0;
READ ALL INTO C0;
/* GET XBAR1 AND XBAR2 */
xbar1 = t(C1[+,]/23);
xbar2 = t(C2[+,]/12);
/* COMPUTE CORRELATION MATRIX corr_c1 */
sum=c1[+,];
n1=nrow(c1);
step1=t(c1)*c1-t(sum)*sum/n1;
step2=diag(1/sqrt(vecdiag(step1)));
corr_c1=step2*step1*step2;
/* COMPUTE STDEV1 */
temp1=C1;
DO col=1 TO 10;
DO row=1 TO n1;
temp1[row,col]=(temp1[row,col]-xbar1[col])**2;
END;
END;
stdev1 = sqrt(t(temp1[+,]/(n1-1)));
/* COMPUTE CORRELATION MATRIX corr_c2 */
sum=c2[+,];
n2=nrow(c2);
step1=t(c2)*c2-t(sum)*sum/n2;
step2=diag(1/sqrt(vecdiag(step1)));
corr_c2=step2*step1*step2;
P a g e |3
Project 1
Problem 1
P a g e |4
/* COMPUTE STDEV2 */
temp2=C2;
DO col=1 TO 10;
DO row=1 TO n2;
temp2[row,col]=(temp2[row,col]-xbar2[col])**2;
END;
END;
stdev2 = sqrt(t(temp2[+,]/(n2-1)));
/*
COMPUTE VAR/COV MATRIX S1 */
S1 = j(10,10,0);
DO i=1 TO 10;
DO j=1 TO 10;
S1[i,j]=corr_c1[i,j]*stdev1[i]*stdev1[j];
END;
END;
/*
COMPUTE VAR/COV MATRIX S2 */
S2 = j(10,10,0);
DO i=1 TO 10;
DO j=1 TO 10;
S2[i,j]=corr_c2[i,j]*stdev2[i]*stdev2[j];
END;
END;
/* COMPUTE W */
W=j(40,1,0);
/* http://www.palass.org/modules.php?name=palaeo_math&page=17 */
Sp = ((n1-1)*S1+(n2-1)*S2)/(n1+n2-2);
DO obs=1 TO 40;
W[obs] = (xnoclass[obs,]*inv(Sp)*(xbar1-xbar2));
W[obs] = W[obs] - (.5*t(xbar1-xbar2)*inv(Sp)*(xbar1-xbar2));
END;
/* COMPUTE OUTPUT */
final = X[,11] || ((W<0)+1) || (X[,11]=((W<0)+1));
DO obs=36 TO 40;
output = obs || w[obs] || ((W[obs]<0)+1);
print output;
END;
RUN;
Problem 2
Project 1
Problem 2
P a g e |5
Output
Part a:
Part b:
Code
####### Problem 1a #######
#
#
#
Parameters
#
#
#
n1
<- 18
#
mu1
<- 30
#
sig1
<- 10
#
n2
<- 18
#
mu2
<- 40
#
sig2
<- 10
#
sims
<- 10000
#
#
#
###############################################################################
#
#
#
Initialize Matrices
#
#
xvalues1
<- c(sims*n1)
xvalues2
<- c(sims*n1)
xbar1
<- c(sims)
#
xbar2
<- c(sims)
#
s1
<- c(sims)
#
s2
<- c(sims)
#
teststatpooled
<- c(sims)
#
teststatunpooled
<- c(sims)
#
pvaluepooled <- c(sims)
#
#
#
#
Project 1
Problem 2
pvalueunpooled
#
P a g e |6
<- c(sims)
#
#
#
###############################################################################
#
#
#
Program Body
#
#
#
xvalues1
xvalues2
<- matrix(rnorm(sims*n1, mu1, sig1), nrow=sims, ncol=n1)
<- matrix(rnorm(sims*n2, mu2, sig2), nrow=sims, ncol=n2)
xbar1
xbar2
<- apply(xvalues1,1,mean)
<- apply(xvalues2,1,mean)
for (i in 1:sims)
{
s1[i]=sd(xvalues1[i,])
s2[i]=sd(xvalues2[i,])
}
pooledtotal
unpooledtotal
<- 0
<- 0
#will count number of p-values > .05
for (i in 1:sims)
{
spooled
<- sqrt(((n1-1)*s1[i]*s1[i]+(n2-1)*s2[i]*s2[i])/(n1+n2-2))
dfpooled
<- n1+n2-2
dfunpooled_n <- (((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2))**2)
dfunpooled_d1 <- ((s1[i]*s1[i]/n1)**2)/(n1-1)
dfunpooled_d2 <- ((s2[i]*s2[i]/n2)**2)/(n2-1)
dfunpooled
<- dfunpooled_n/(dfunpooled_d1+dfunpooled_d2)
teststatpooled
teststatunpooled
<- (xbar1-xbar2)/(spooled*sqrt((1/n1)+(1/n2)))
<- (xbar1-xbar2)/sqrt((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2))
pvaluepooled[i]
<- 2*(1-pt(abs(teststatpooled[i]),dfpooled))
pvalueunpooled[i]
<- 2*(1-pt(abs(teststatunpooled[i]),dfunpooled))
if (pvaluepooled[i] > .05)
{
pooledtotal=pooledtotal+1;
}
if (pvalueunpooled[i] > .05)
{
unpooledtotal=unpooledtotal+1;
}
}
#
#
###############################################################################
#
#
#
Program Conclusion
#
#
#
pooledtotal/10000
unpooledtotal/10000
Project 1
Problem 2
P a g e |7
#
#
###############################################################################
####### Problem 1b #######
#
#
#
Parameters
#
#
#
n1
<- 15
#
mu1
<- 50
#
sig1
<- 4
#
n2
<- 18
#
mu2
<- 45
#
sig2
<- 5
#
sims
<- 10000
#
#
#
###############################################################################
#
#
#
Initialize Matrices
#
#
xvalues1
<- c(sims*n1)
xvalues2
<- c(sims*n1)
xbar1
<- c(sims)
#
xbar2
<- c(sims)
#
s1
<- c(sims)
#
s2
<- c(sims)
#
teststatpooled
<- c(sims)
#
teststatunpooled
<- c(sims)
#
pvaluepooled <- c(sims)
pvalueunpooled
<- c(sims)
#
#
#
#
#
#
#
#
###############################################################################
#
#
#
Program Body
#
#
#
xvalues1
xvalues2
<- matrix(rnorm(sims*n1, mu1, sig1), nrow=sims, ncol=n1)
<- matrix(rnorm(sims*n2, mu2, sig2), nrow=sims, ncol=n2)
xbar1
xbar2
<- apply(xvalues1,1,mean)
<- apply(xvalues2,1,mean)
for (i in 1:sims)
{
s1[i]=sd(xvalues1[i,])
s2[i]=sd(xvalues2[i,])
}
Project 1
pooledtotal
unpooledtotal
Problem 2
<- 0
<- 0
P a g e |8
#will count number of p-values > .05
for (i in 1:sims)
{
spooled
<- sqrt(((n1-1)*s1[i]*s1[i]+(n2-1)*s2[i]*s2[i])/(n1+n2-2))
dfpooled
<- n1+n2-2
dfunpooled_n <- (((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2))**2)
dfunpooled_d1 <- ((s1[i]*s1[i]/n1)**2)/(n1-1)
dfunpooled_d2 <- ((s2[i]*s2[i]/n2)**2)/(n2-1)
dfunpooled
<- dfunpooled_n/(dfunpooled_d1+dfunpooled_d2)
teststatpooled
teststatunpooled
<- (xbar1-xbar2)/(spooled*sqrt((1/n1)+(1/n2)))
<- (xbar1-xbar2)/sqrt((s1[i]*s1[i]/n1)+(s2[i]*s2[i]/n2))
pvaluepooled[i]
<- 2*(1-pt(abs(teststatpooled[i]),dfpooled))
pvalueunpooled[i]
<- 2*(1-pt(abs(teststatunpooled[i]),dfunpooled))
if (pvaluepooled[i] > .05)
{
pooledtotal=pooledtotal+1;
}
if (pvalueunpooled[i] > .05)
{
unpooledtotal=unpooledtotal+1;
}
}
#
#
###############################################################################
#
#
#
Program Conclusion
#
#
#
pooledtotal/10000
unpooledtotal/10000
#
#
###############################################################################
Problem 3
Project 2
Problem 3
Data Import
setwd("R:/Fridline/Statistical Data Management/Project #3")
getwd()
bootdata <- read.table("Proj3_prob3.txt", header=T, sep=" ")
PART A: Estimate for ρ
The test statistic (.29932) is displayed in the output below.
corr_calculation = c(400)
x1bar = mean(bootdata[,1])
x2bar = mean(bootdata[,2])
sumofx = 0
sumofy = 0
sumofx2 = 0
sumofy2 = 0
sumofxy = 0
for (i in 1:400)
{
sumofx = sumofx + bootdata[i,1]
sumofy = sumofy + bootdata[i,2]
sumofx2 = sumofx2 + (bootdata[i,1]**2)
sumofy2 = sumofy2 + (bootdata[i,2]**2)
sumofxy = sumofxy + bootdata[i,1]*bootdata[i,2]
}
r_numerator = (400*sumofxy) - (sumofx*sumofy)
r_denominator1 = (400*sumofx2) - ((sumofx)**2)
r_denominator2 = (400*sumofy2) - ((sumofy)**2)
r_denominator = sqrt(r_denominator1*r_denominator2)
r = r_numerator/r_denominator
r
Page |9
Project 2
Problem 3
P a g e | 10
PART B, C, D: The Bootstrap Distribution
The histogram displays the boostrap distribution. Note that it appears approximately normally
distributed and symmetric. The output below displays the mean (.29854), bias (-.00078), and
standard error (.05043) of the bootstrap distribution.
sim=1000
simulatedvalues <- c(sim)
for(i in 1:sim)
{
n <- 400
index <- sample(1:400,400,r=T)
X1 <- bootdata[index,1]
Y1 <- bootdata[index,2]
Xsq <- X1^2
Ysq <- Y1^2
XY <- X1*Y1
Xcume <- sum(X1)
Ycume <- sum(Y1)
Xsqcume <- sum(Xsq)
Ysqcume <- sum(Ysq)
XYcume <- sum(XY)
r <- (XYcume - (((Xcume)*(Ycume))/n))/(sqrt(Xsqcume - (Xcume^2/n))*sqrt(Ysqcume - (Ycume^2/n)))
simulatedvalues[i] <- r
}
mean(simulatedvalues)
bias = mean(simulatedvalues) - .2993151
bias
sd(simulatedvalues)
hist(simulatedvalues)
PART E: 95% Bootstrap t Confidence Interval
The 95% Bootstrap t confidence interval is (.20017, .39846). Due to the low bias and the
approximate normality of the bootstrap distribution, this is a good estimate of the 95%
confidence interval of the sampling distribution.
Project 2
Problem 3
P a g e | 11
# 95% confidence interval
UB = originalr + (qt(.975,399)*sd(simulatedvalues))
LB = originalr - (qt(.975,399)*sd(simulatedvalues))
UB
LB
PART F: 95% Bootstrap Percentile Confidence Interval
The 95% Bootstrap percentile confidence interval is (.19275, .38115). Notice this is fairly close
to the 95% Bootstrap t confidence interval.
# 95% percentile interval
sortedvalues = sort(simulatedvalues)
LBpercentile = sortedvalues[25]
UBpercentile = sortedvalues[950]
LBpercentile
UBpercentile
PART G, H, I: Jackknife
Project 2
Problem 3
P a g e | 12
x1 = bootdata[,1]
y1 = bootdata[,2]
n = 400
jack <- rep(0,n)
for(j in 1:n) {
sampx <- x1[-j]
sampy <- y1[-j]
jack[j] = cor(sampx,sampy)
}
mean(jack)
originalr
meanjack = mean(jack)
biasjack = 399*(meanjack - originalr)
biasjack #g
biasedcorrect = originalr - biasjack
biasedcorrect #h
se = sqrt(((399)/400)*sum((jack-meanjack)**2))
LB = biasedcorrect - (qt(.975,399)*se)
UB = biasedcorrect + (qt(.975,399)*se)
LB #i
UB #i
PART J: Confidence Interval Discussion
The confidence intervals were very close to each other, distributions are approximately normal,
and bias is small, so the confidence intervals are fairly accurate.s
Problem 4
Project 2
Problem 4
Scatterplot
data = c(1,
1.26,
-.055,
2,
1.85,
-.04,
3,
1.1,
-.026,
4,
2.5,
-.017,
5,
2.17,
-.017,
6,
2.67,
.017,
7,
2.01,
.021,
8,
2.18,
.025,
9,
2.58,
.027,
10,
2.75,
.033,
11,
2.75,
.064,
12,
3.33,
.077,
13,
3.65,
.124)
dataset =t(matrix(data,ncol=13))
plot(dataset[,2],dataset[,3],xlab="Social Distress", ylab="Brain Activity")
cor(dataset[,2],dataset[,3])
Permutation Distribution
P a g e | 13
Project 2
Problem 4
P a g e | 14
Since the v-value is 0, at any significance level, there is enoguh
evidence to conclude the correlation is larger than 0.
data = c(1,
1.26,
-.055,
2,
1.85,
-.04,
3,
1.1,
-.026,
4,
2.5,
-.017,
5,
2.17,
-.017,
6,
2.67,
.017,
7,
2.01,
.021,
8,
2.18,
.025,
9,
2.58,
.027,
10,
2.75,
.033,
11,
2.75,
.064,
12,
3.33,
.077,
13,
3.65,
.124)
dataset =t(matrix(data,ncol=13))
plot(dataset[,2],dataset[,3],xlab="Social Distress", ylab="Brain Activity")
teststat = cor(dataset[,2],dataset[,3])
corvals = c(1000)
for(i in 1:1000)
{
index <- sample(1:13,13,r=F)
newactivity <- dataset[index,3]
corvals[i] = cor(dataset[,2],newactivity)
}
hist(corvals)
mean(corvals)
sd(corvals)
pvalue = 0
for(i in 1:1000)
{
pvalue = pvalue + (teststat<corvals[1])
}
pvalue = pvalue/1000
pvalue
Download