INFERENTIAL STATISTICS IN EXCEL

advertisement
INFERENTIAL STATISTICS IN EXCEL
ONE VARIABLE
I) One Numeric Variable (not comparing populations)
A) One sample z or t test for Ho:  (, =, ) o
1. Create column containing sample data
2. Check normality assumption with normal quantile plot if small sample size (see “Descriptive Statistics in Excel”)
3. Calculate test statistic
z = (xbar-o)/(/sqrt n) if  is known (z test)
t = (xbar-o)/(s/sqrt n) if  is not known (t test)
4. Calculate p value
a. z tests
“= 2*normsdist(z)” for Ho:  = o
(“normsdist” gives upper lower probabilities)
“= 1- normsdist (z)” for Ho:   o
“=normsdist (z)” for Ho:   o
b. t tests
“= tdist(t, df, 2)” for Ho:  = o
(“tdist” gives upper tail probabilities)
“= tdist(t, df, 1)” for Ho:   o
“= 1-tdist(t, df, 1)” for Ho:   o
EX: ONE SAMPLE T TEST
Ho:   100 (use  in place of s for z test)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A
velocity
170
290
-130
-70
-185
-220
200
290
270
200
300
-30
650
B
xbar
s
n
t
P value
C
=average(A2:A14)
=stdev(A2:A14)
=count(A2:A14)
=(C3-100)/(C4/sqrt(C5))
=1-tdist(C6, C5-1, 1)

Normal Quantile Plot of Velocity
1.327
0.827
0.327
-0.173
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A
velocity
170
290
-130
-70
-185
-220
200
290
270
200
300
-30
650
B
xbar
s
n
t
P value
C
D
133.4615385
247.7009674
13
0.487068315
0.68250722
-0.673
-1.173
-1.673
-220.00 -20.000 180.000 380.000 580.000
0
II) One Categorical Variable (not comparing populations)
A) Chi-square goodness of fit test
1. Create column containing sample data
2. Create a modified frequency table of results (see “Descriptive Statistics in Excel” handout)
a. Title second column “observed frequency”
b. Do not include relative frequency column
3. In the third column, enter the expected frequency under Ho
a. Expected frequency = hypothesized percentage*sample size
4. Check large sample assumption by making sure no expected frequency is  10
5. In the fourth column, calculate (observed frequency–expected frequency)2/ expected frequency
6. Calculate test statistic, X2 = SUM(observed frequency–expected frequency)2/ expected frequency
7. Calculate p value, “=chidist(X2,df)”
EX: CHI SQUARE GOODNESS OF FIT TEST
Ho: Test that P(round) = .75, P(wrinkled) = .25
1
2
3
4
5
6
A
Seed
Form
Round
Wrinkled
B
Observed
Frequency
336
101
C
Expected
Frequency
=.75*sum(B2:B3)
=.25*sum(B2:B3)
chisquare
pvalue
D
(O-E)^2/E
=(B2-C2)^2/C2
=(B3-C3)^2/C3
=sum(D2:D3)
=chidist(D5,count(B2:B3)-1)
1

2
3
4
5
6
A
Seed
Form
Round
Wrinkled
B
Observed
Frequency
336
101
C
Expected
Frequency
327.75
109.25
chisquare
pvalue
D
(O-E)^2/E
0.207666
0.622998
0.830664
0.362081
TWO VARIABLES
II) Categorical vs. Numeric (comparing population means)
A) Unstacking data (grouping numeric observations by values of categorical variable)
1. Start with 2 columns of data, one for categorical variable, one for numeric variable
2. StatPlus  Manipulate Columns  Unstack Columns
Click Data Values
 Use Range References: Highlight numeric column
 Range includes a row of column labels if variable name highlighted with data
Click Categories
 Use Range References: Highlight categorical column
 Range includes a row of column labels if variable name highlighted with data
Click Output
 Cell: Highlight cell you want as upper left corner of output
 Dynamic if you want output to update itself upon any future changes
or  Static if you do not want output to update itself upon any future changes
 Sort the Columns if you would like the columns arranged in alphabetical order
EX: StatPlus  Manipulate Columns  Unstack Columns
Data Values,  Use Range References Sheet1!$B$1:$B$40,  Range includes a row of column labels
Categories,  Use Range References Sheet1!$A$1:$A$40,  Range includes a row of column labels
Output,  Cell Sheet1!$D$1,  Dynamic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
A
Pitch
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Soprano
Alto
Alto
Alto
Alto
Alto
Alto
Alto
Alto
Alto
Alto
Alto
Bass
Bass
Bass
Bass
Bass
Bass
Bass
Bass
Bass
Bass
Bass
Bass
Bass
Bass
B
Height
64
62
66
65
60
61
65
66
65
63
67
65
62
65
65
62
68
67
67
63
67
66
63
72
62
72
70
72
69
73
71
72
68
68
71
66
68
71
73
C
D
Soprano
64
62
66
65
60
61
65
66
65
63
67
65
62
65
E
Alto
65
62
68
67
67
63
67
66
63
72
62
F
Bass
72
70
72
69
73
71
72
68
68
71
66
68
71
73
G
H
I
B) t tests (comparing 2 population means, independent samples)
1. Unstack data into 2 columns, one for each value of your categorical variable
2. Check normality assumption with normal quantile plots if small sample sizes
3. Check equal variance assumption with F Test for equal variances
a. ToolsData AnalysisF-test Two-Sample for Variances
Variable 1 Range: Highlight first column
Variable 2 Range: Highlight second column
 Labels if headings included
Output Range: Click on cell that you want as the upper left corner of table
b. Find test statistic, F
c. Find p value, P(F<=f) one-tail
EX: NORMAL QUANTILE PLOTS AND TEST FOR EQUAL VARIANCES
Ho: Variance of soprano heights = Variance of alto heights
Variable 1 Input Range$A$1:$A$15, Variable 2 Input Range$B$1:$B$12
 Labels in first row, Output Range $D$2
Normal Quantile Plot of Soprano
Heights
1.292
0.292
-0.708
-1.708
60
62
64
66
Normal Quantile Plot of Alto Heights
1.218
0.718
0.218
-0.282
-0.782
-1.282
62
64
66
68
70
72
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
A
Soprano
64
62
66
65
60
61
65
66
65
63
B
Alto
65
62
68
67
67
63
67
66
63
72
67
65
62
65
62
C
D
E
F-test Two-Sample for Variances
Soprano
64
4.3076
14
13
0.4654
0.0985
0.3743
Mean
Variance
Observations
df
F
P(F<=f) one-tail
F Critical one-tail
4) Two sample t test for equal variances (pooled t test) -- If fail to reject Ho: Variances are equal
a. ToolsData Analysis  t-test: Two Sample Assuming Equal Variances
Variable 1 Range: Highlight first column
Variable 2 Range: Highlight second column
 Labels if headings included
Output Range: Click on cell that you want as the upper left corner of table
b. Find test statistic, t Stat
c. Find p value
“P(T<=t) two-tail” for Ho: μ1 - μ2 = 0
1-“P(T<=t) one-tail” for Ho: μ1 - μ2 ≤ 0
“P(T<=t) one-tail” for Ho: μ1 - μ2 ≥ 0
EX:TWO SAMPLE T TEST, ASSUMING EQUAL VARIANCES
Variable 1 Range$A$1:$A$15, Variable 2 Range$B$1:$B$12, Labels in first row, Output Range $D$2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
A
Soprano
64
62
66
65
60
61
65
66
65
63
67
65
62
65
B
Alto
65
62
68
67
67
63
67
66
63
72
62
C
F
D
E
F
Soprano
64
4.3076
14
6.4584
0
23
-1.5981
0.0618
1.7138
0.1236
2.0686
Alto
65.6363
9.25454
11
t-test: Two Sample Assuming Equal Variances
Mean
Variance
Observations
Pooled Variance
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
G
Alto
65.6363
9.2545
11
10
G
5) Two sample t test for unequal variances -- If reject Ho: Variances are equal
a. ToolsData Analysis  t-test: Two Sample Assuming Unequal Variances
Variable 1 Range: Highlight first column
Variable 2 Range: Highlight second column
 Labels if headings included
Output Range: Click on cell that you want as the upper left corner of table
b. Find test statistic, t Stat
c. Find p value
“P(T<=t) two-tail” for Ho: μ1 - μ2 = 0
1-“P(T<=t) one-tail” for Ho: μ1 - μ2 ≤ 0
“P(T<=t) one-tail” for Ho: μ1 - μ2 ≥ 0
NOTE: μ1 corresponds to the first column, μ2 corresponds to the second column
EX: T TEST, ASSUMING UNEQUAL VARIANCES
Variable 1 Range$A$1:$A$15, Variable 2 Range$B$1:$B$12,  Labels in first row, Output Range $D$2
A
B
C
D
E
F
G
1
Soprano
Alto
64
65
t-test: Two Sample Assuming Equal Variances
2
3
62
62
4
66
68
Soprano
Alto
65
67
Mean
64
65.6363
5
6
60
67
Variance
4.3076
9.2545
7
61
63
Observations
14
11
8
65
67
Hypothesized Mean Difference
0
9
66
66
df
17
10
65
63
t Stat
-1.5265
11
63
72
P(T<=t) one-tail
0.0726
12
67
62
t Critical one-tail
1.7396
13
65
P(T<=t) two-tail
0.1452
62
t Critical two-tail
2.1098
14
65
15
16
17
B) Paired t test (paired samples)
1. Create 2 columns of sample data, one column for each value of your categorical variable, one row per matching pair
2. Create “differences” column, containing (column 1 – column 2)
3. Check normality assumption with normal quantile plot of differences if small number of pairs
4. ToolsData Analysis  t-test: Paired Two Sample for Means
Variable 1 Range: Highlight first column
Variable 2 Range: Highlight second column
Hypothesized Mean Difference = 0
 Labels if headings included
Output Range: Click on cell that you want as the upper left corner of table
5. Find test statistic, t Stat
6. Find p value
“P(T<=t) two-tail” for Ho: μd = 0
“P(T<=t) one-tail” for Ho: μd ≤ 0
1- “P(T<=t) one-tail” for Ho: μd ≥ 0
NOTE: μd refers to mean difference where the second column is subtracted from the first
EX: PAIRED T TEST
Variable 1 Range$A$1:$A$12, Variable 2 Range$B$1:$B$12,
Hypothesized Mean Difference = 0,  Labels in first row, Output Range $D$2
Normal Quantile Plot of
Differences
1.218
0.718
0.218
-0.282
-0.782
-1.282
-2
0
2
4
6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A
Before
8
7
6
12
5
4
10
9
4
7
11
B
After
5
7
4
6
7
2
7
9
6
4
7
C
Difference
3
0
2
6
-2
2
3
0
-2
3
4
D
t-test: Paired Two Sample for Means
Mean
Variance
Observations
Pearson Correlation
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
E
F
Before
7.5454
7.4727
11
0.4731
0
10
2.2973
0.0222
1.8124
0.0444
2.2281
After
5.8181
3.7636
11
G
C) One Way ANOVA (comparing k population means, independent samples)
1. Create Input Range: 2 columns of data, one for each value of your categorical variable
2. Check normality assumption with Normal quantile plots if small sample sizes
3. Test for equal variances with ToolsData AnalysisF-test Two-Sample for Variances????
4. ToolsData Analysis  Anova: Single Factor
Input Range: Highlight all k columns at once
 Labels in First Row if headings included
Output Range: Click on cell that you want as the upper left corner of table
5. Find test statistic, F
6. Find p value, P-value
EX: SINGLE FACTOR ANOVA, k =3
Ho: μ1 = μ2 = μ3
Input Range$A$1:$C$15, Variable 2 Input Range$B$1:$B$15,
 Labels in first row, Output Range $D$2
Normal Quantile Plot of Soprano
Heights
1.292
0.292
-0.708
-1.708
60
62
64
66
Normal Quantile Plot of Alto Heights
1.218
0.718
0.218
-0.282
-0.782
-1.282
62
64
66
68
70
Normal Quantle Plot of Bass
Heights
72
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A
Soprano
64
62
66
65
60
61
65
66
65
63
67
65
62
65
B
Alto
65
62
68
67
67
63
67
66
63
72
62
C
Bass
72
70
72
69
73
71
72
68
68
71
66
68
71
73
D
E
F
G
H
Count
14
11
14
Sum
896
722
984
Average
64
65.6363
70.2857
Variance
4.3076
9.2545
4.6813
ANOVA
Source of Variation
Between Groups
Within Groups
SS
294.4948
209.4025
df
MS
147.2474
5.8167
F
25.3144
Total
503.8974
SUMMARY
Groups
Soprano
Alto
Bass
2
36
38
0.292
-0.708
-1.708
68
70
72
III) Categorical vs. Categorical (comparing population proportions, testing independence of categorical variables)
A) Chi-Square test of independence
1. Find template with appropriate r and c
a. r = Number of possible values for variable 1
b. c = Number of possible values for variable 2
2. Fill sample frequencies in the shaded cells
a. Row%, column%, totals, expected freq’s, and (O-E)^2/E are automatically calculated
3. Check large sample assumption by making sure no expected frequency is  10
4. Find test statistic, X2
5. Find p value, pvalue
EX: CHI SQUARE TEST OF INDEPENDENCE
r = 2, c = 3  “2X3CHI2” template
Ho: Goals are independent of gender
A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Gender
B
C
Boy
row %
col %
Girl
row %
col %
total
Grades
117
0.5154
0.4737
130
0.5179
0.5263
247
D
Goal
Popular
50
0.2203
0.3546
91
0.3625
0.6454
141
J
P-value
1.37E-07
F crit
3.2594
Anova: Single Factor
1.292
66
I
E
F
Sports
60
0.2643
0.6667
30
0.1195
0.3333
90
total
227
251
478
G
H
I
J
Observed
117
50
60
130
91
30
Expected
117.2992
66.9603
42.7406
129.7008
74.0397
47.2594
X2
df
pvalue
(O-E)^2/E
0.000763
4.295834
6.969661
0.00069
3.885077
6.303239
21.45526
2
2.19E-05
IV) Numeric vs. Numeric (comparing means response across a range of explanatory values)
A) Simple Linear Regression -- If fail to reject Ho: Variances are equal
1. Create 2 columns of sample data, one column per variable, one row per observation
2. ToolsData Analysis  Regression
Input Y Range: Highlight second column (response variable if applicable)
Input X Range: Highlight first column (explanatory variable if applicable)
 Labels if headings included
Output Range: Click on cell that you want as the upper left corner of table
 Residual plots
 Line fit plots
 Standardized residuals
 Normal probability plots
3. Check Normality/no outlier assumption
a. Normal Quantile (probability) plot of residuals
b. Make sure no standardized residuals are less than –3 or more than 3
c. Make sure no patterns in residual plots
4. Check linear relationship assumption by examining Line fit plot
5. Find test statistic for Ho: Slope = 0 (no linear relationship), t Stat (in second row of third table)
6. Find p value of test statistic, P-value (in second row of third table)
7. Find predicted slope, coefficients (in second row of third table)
EX: T TEST, ASSUMING UNEQUAL VARIANCES
Ho: The mean velocity in which stellar objects move away from earth does not depend linearly on their distance from earth
Input Y Range$B$1:$A$14, Input X Range$A$1:$A$14,  Labels, Output Range $A$16,
Residual plots, Line fit plots, Standardized residuals, Normal probability plot
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
A
B
distance
velocity
C
0.032
0.034
0.214
0.263
0.275
0.275
0.45
0.5
0.5
0.63
0.8
0.9
0.9
170
290
-130
-70
-185
-220
200
290
270
200
300
-30
650
D
E
G
H
I
Normal Probability Plot
800
600
SUMMARY OUTPUT
400
200
0
-200 0
50
100
150
-400
Statistics
0.4306
0.1854
0.1114
233.4910
13
Sample Percentile
distance Line Fit Plot
700
600
500
400
velocity
Regression
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
F
velocity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
velocity
300
200
100
Predicted
velocity
0
ANOVA
-100
df
1
11
12
Regression
Residual
Total
SS
136570.384
599698.8467
736269.2308
Coefficients
-25.6264
358.2441
Intercept
distance
MS
136570.384
54518.0769
Standard Error
119.5696
226.3451
t Stat
-0.21432
1.582734
Residuals
184.162606
303.446118
-181.037827
-138.59179
-257.890720
-292.890720
64.4165539
136.504346
116.504346
-0.06739215
39.0311032
-326.7933
Standard
Residuals
0.823806
1.357392
-0.809829
-0.619956
-1.153611
-1.310175
0.288151
0.610619
0.521154
-0.000301
0.174596
-1.461831
RESIDUAL OUTPUT
Observation
1
2
3
4
5
6
7
8
9
10
11
12
F
2.5050
P-value
0.8342
0.1417
SignificanceF
0.1417
Lower 95%
-288.79748
-139.93832
PROBABILITY
Predicted velocity
-14.162606
-13.446118
51.037827
68.591791
72.890720
72.890720
135.58344
153.49565
153.49565
200.06739
260.96889
296.79331
Percentile
3.846153
11.53846
19.23076
26.92307
34.61538
42.30769
50
57.69230
65.38461
73.07692
80.76923
88.46153
-200
0
0.5
1
-300
distance
Upper 95%
237.544639
856.426606
OUTPUT
velocity
-220
-185
-130
-70
-30
170
200
200
270
290
290
300
Lower 95.0%
-288.797477
-139.938316
Upper 95.0%
237.5446
856.4266
Download