INFERENTIAL STATISTICS IN EXCEL ONE VARIABLE I) One Numeric Variable (not comparing populations) A) One sample z or t test for Ho: (, =, ) o 1. Create column containing sample data 2. Check normality assumption with normal quantile plot if small sample size (see “Descriptive Statistics in Excel”) 3. Calculate test statistic z = (xbar-o)/(/sqrt n) if is known (z test) t = (xbar-o)/(s/sqrt n) if is not known (t test) 4. Calculate p value a. z tests “= 2*normsdist(z)” for Ho: = o (“normsdist” gives upper lower probabilities) “= 1- normsdist (z)” for Ho: o “=normsdist (z)” for Ho: o b. t tests “= tdist(t, df, 2)” for Ho: = o (“tdist” gives upper tail probabilities) “= tdist(t, df, 1)” for Ho: o “= 1-tdist(t, df, 1)” for Ho: o EX: ONE SAMPLE T TEST Ho: 100 (use in place of s for z test) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A velocity 170 290 -130 -70 -185 -220 200 290 270 200 300 -30 650 B xbar s n t P value C =average(A2:A14) =stdev(A2:A14) =count(A2:A14) =(C3-100)/(C4/sqrt(C5)) =1-tdist(C6, C5-1, 1) Normal Quantile Plot of Velocity 1.327 0.827 0.327 -0.173 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A velocity 170 290 -130 -70 -185 -220 200 290 270 200 300 -30 650 B xbar s n t P value C D 133.4615385 247.7009674 13 0.487068315 0.68250722 -0.673 -1.173 -1.673 -220.00 -20.000 180.000 380.000 580.000 0 II) One Categorical Variable (not comparing populations) A) Chi-square goodness of fit test 1. Create column containing sample data 2. Create a modified frequency table of results (see “Descriptive Statistics in Excel” handout) a. Title second column “observed frequency” b. Do not include relative frequency column 3. In the third column, enter the expected frequency under Ho a. Expected frequency = hypothesized percentage*sample size 4. Check large sample assumption by making sure no expected frequency is 10 5. In the fourth column, calculate (observed frequency–expected frequency)2/ expected frequency 6. Calculate test statistic, X2 = SUM(observed frequency–expected frequency)2/ expected frequency 7. Calculate p value, “=chidist(X2,df)” EX: CHI SQUARE GOODNESS OF FIT TEST Ho: Test that P(round) = .75, P(wrinkled) = .25 1 2 3 4 5 6 A Seed Form Round Wrinkled B Observed Frequency 336 101 C Expected Frequency =.75*sum(B2:B3) =.25*sum(B2:B3) chisquare pvalue D (O-E)^2/E =(B2-C2)^2/C2 =(B3-C3)^2/C3 =sum(D2:D3) =chidist(D5,count(B2:B3)-1) 1 2 3 4 5 6 A Seed Form Round Wrinkled B Observed Frequency 336 101 C Expected Frequency 327.75 109.25 chisquare pvalue D (O-E)^2/E 0.207666 0.622998 0.830664 0.362081 TWO VARIABLES II) Categorical vs. Numeric (comparing population means) A) Unstacking data (grouping numeric observations by values of categorical variable) 1. Start with 2 columns of data, one for categorical variable, one for numeric variable 2. StatPlus Manipulate Columns Unstack Columns Click Data Values Use Range References: Highlight numeric column Range includes a row of column labels if variable name highlighted with data Click Categories Use Range References: Highlight categorical column Range includes a row of column labels if variable name highlighted with data Click Output Cell: Highlight cell you want as upper left corner of output Dynamic if you want output to update itself upon any future changes or Static if you do not want output to update itself upon any future changes Sort the Columns if you would like the columns arranged in alphabetical order EX: StatPlus Manipulate Columns Unstack Columns Data Values, Use Range References Sheet1!$B$1:$B$40, Range includes a row of column labels Categories, Use Range References Sheet1!$A$1:$A$40, Range includes a row of column labels Output, Cell Sheet1!$D$1, Dynamic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 A Pitch Soprano Soprano Soprano Soprano Soprano Soprano Soprano Soprano Soprano Soprano Soprano Soprano Soprano Soprano Alto Alto Alto Alto Alto Alto Alto Alto Alto Alto Alto Bass Bass Bass Bass Bass Bass Bass Bass Bass Bass Bass Bass Bass Bass B Height 64 62 66 65 60 61 65 66 65 63 67 65 62 65 65 62 68 67 67 63 67 66 63 72 62 72 70 72 69 73 71 72 68 68 71 66 68 71 73 C D Soprano 64 62 66 65 60 61 65 66 65 63 67 65 62 65 E Alto 65 62 68 67 67 63 67 66 63 72 62 F Bass 72 70 72 69 73 71 72 68 68 71 66 68 71 73 G H I B) t tests (comparing 2 population means, independent samples) 1. Unstack data into 2 columns, one for each value of your categorical variable 2. Check normality assumption with normal quantile plots if small sample sizes 3. Check equal variance assumption with F Test for equal variances a. ToolsData AnalysisF-test Two-Sample for Variances Variable 1 Range: Highlight first column Variable 2 Range: Highlight second column Labels if headings included Output Range: Click on cell that you want as the upper left corner of table b. Find test statistic, F c. Find p value, P(F<=f) one-tail EX: NORMAL QUANTILE PLOTS AND TEST FOR EQUAL VARIANCES Ho: Variance of soprano heights = Variance of alto heights Variable 1 Input Range$A$1:$A$15, Variable 2 Input Range$B$1:$B$12 Labels in first row, Output Range $D$2 Normal Quantile Plot of Soprano Heights 1.292 0.292 -0.708 -1.708 60 62 64 66 Normal Quantile Plot of Alto Heights 1.218 0.718 0.218 -0.282 -0.782 -1.282 62 64 66 68 70 72 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 A Soprano 64 62 66 65 60 61 65 66 65 63 B Alto 65 62 68 67 67 63 67 66 63 72 67 65 62 65 62 C D E F-test Two-Sample for Variances Soprano 64 4.3076 14 13 0.4654 0.0985 0.3743 Mean Variance Observations df F P(F<=f) one-tail F Critical one-tail 4) Two sample t test for equal variances (pooled t test) -- If fail to reject Ho: Variances are equal a. ToolsData Analysis t-test: Two Sample Assuming Equal Variances Variable 1 Range: Highlight first column Variable 2 Range: Highlight second column Labels if headings included Output Range: Click on cell that you want as the upper left corner of table b. Find test statistic, t Stat c. Find p value “P(T<=t) two-tail” for Ho: μ1 - μ2 = 0 1-“P(T<=t) one-tail” for Ho: μ1 - μ2 ≤ 0 “P(T<=t) one-tail” for Ho: μ1 - μ2 ≥ 0 EX:TWO SAMPLE T TEST, ASSUMING EQUAL VARIANCES Variable 1 Range$A$1:$A$15, Variable 2 Range$B$1:$B$12, Labels in first row, Output Range $D$2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 A Soprano 64 62 66 65 60 61 65 66 65 63 67 65 62 65 B Alto 65 62 68 67 67 63 67 66 63 72 62 C F D E F Soprano 64 4.3076 14 6.4584 0 23 -1.5981 0.0618 1.7138 0.1236 2.0686 Alto 65.6363 9.25454 11 t-test: Two Sample Assuming Equal Variances Mean Variance Observations Pooled Variance Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail G Alto 65.6363 9.2545 11 10 G 5) Two sample t test for unequal variances -- If reject Ho: Variances are equal a. ToolsData Analysis t-test: Two Sample Assuming Unequal Variances Variable 1 Range: Highlight first column Variable 2 Range: Highlight second column Labels if headings included Output Range: Click on cell that you want as the upper left corner of table b. Find test statistic, t Stat c. Find p value “P(T<=t) two-tail” for Ho: μ1 - μ2 = 0 1-“P(T<=t) one-tail” for Ho: μ1 - μ2 ≤ 0 “P(T<=t) one-tail” for Ho: μ1 - μ2 ≥ 0 NOTE: μ1 corresponds to the first column, μ2 corresponds to the second column EX: T TEST, ASSUMING UNEQUAL VARIANCES Variable 1 Range$A$1:$A$15, Variable 2 Range$B$1:$B$12, Labels in first row, Output Range $D$2 A B C D E F G 1 Soprano Alto 64 65 t-test: Two Sample Assuming Equal Variances 2 3 62 62 4 66 68 Soprano Alto 65 67 Mean 64 65.6363 5 6 60 67 Variance 4.3076 9.2545 7 61 63 Observations 14 11 8 65 67 Hypothesized Mean Difference 0 9 66 66 df 17 10 65 63 t Stat -1.5265 11 63 72 P(T<=t) one-tail 0.0726 12 67 62 t Critical one-tail 1.7396 13 65 P(T<=t) two-tail 0.1452 62 t Critical two-tail 2.1098 14 65 15 16 17 B) Paired t test (paired samples) 1. Create 2 columns of sample data, one column for each value of your categorical variable, one row per matching pair 2. Create “differences” column, containing (column 1 – column 2) 3. Check normality assumption with normal quantile plot of differences if small number of pairs 4. ToolsData Analysis t-test: Paired Two Sample for Means Variable 1 Range: Highlight first column Variable 2 Range: Highlight second column Hypothesized Mean Difference = 0 Labels if headings included Output Range: Click on cell that you want as the upper left corner of table 5. Find test statistic, t Stat 6. Find p value “P(T<=t) two-tail” for Ho: μd = 0 “P(T<=t) one-tail” for Ho: μd ≤ 0 1- “P(T<=t) one-tail” for Ho: μd ≥ 0 NOTE: μd refers to mean difference where the second column is subtracted from the first EX: PAIRED T TEST Variable 1 Range$A$1:$A$12, Variable 2 Range$B$1:$B$12, Hypothesized Mean Difference = 0, Labels in first row, Output Range $D$2 Normal Quantile Plot of Differences 1.218 0.718 0.218 -0.282 -0.782 -1.282 -2 0 2 4 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A Before 8 7 6 12 5 4 10 9 4 7 11 B After 5 7 4 6 7 2 7 9 6 4 7 C Difference 3 0 2 6 -2 2 3 0 -2 3 4 D t-test: Paired Two Sample for Means Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail E F Before 7.5454 7.4727 11 0.4731 0 10 2.2973 0.0222 1.8124 0.0444 2.2281 After 5.8181 3.7636 11 G C) One Way ANOVA (comparing k population means, independent samples) 1. Create Input Range: 2 columns of data, one for each value of your categorical variable 2. Check normality assumption with Normal quantile plots if small sample sizes 3. Test for equal variances with ToolsData AnalysisF-test Two-Sample for Variances???? 4. ToolsData Analysis Anova: Single Factor Input Range: Highlight all k columns at once Labels in First Row if headings included Output Range: Click on cell that you want as the upper left corner of table 5. Find test statistic, F 6. Find p value, P-value EX: SINGLE FACTOR ANOVA, k =3 Ho: μ1 = μ2 = μ3 Input Range$A$1:$C$15, Variable 2 Input Range$B$1:$B$15, Labels in first row, Output Range $D$2 Normal Quantile Plot of Soprano Heights 1.292 0.292 -0.708 -1.708 60 62 64 66 Normal Quantile Plot of Alto Heights 1.218 0.718 0.218 -0.282 -0.782 -1.282 62 64 66 68 70 Normal Quantle Plot of Bass Heights 72 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A Soprano 64 62 66 65 60 61 65 66 65 63 67 65 62 65 B Alto 65 62 68 67 67 63 67 66 63 72 62 C Bass 72 70 72 69 73 71 72 68 68 71 66 68 71 73 D E F G H Count 14 11 14 Sum 896 722 984 Average 64 65.6363 70.2857 Variance 4.3076 9.2545 4.6813 ANOVA Source of Variation Between Groups Within Groups SS 294.4948 209.4025 df MS 147.2474 5.8167 F 25.3144 Total 503.8974 SUMMARY Groups Soprano Alto Bass 2 36 38 0.292 -0.708 -1.708 68 70 72 III) Categorical vs. Categorical (comparing population proportions, testing independence of categorical variables) A) Chi-Square test of independence 1. Find template with appropriate r and c a. r = Number of possible values for variable 1 b. c = Number of possible values for variable 2 2. Fill sample frequencies in the shaded cells a. Row%, column%, totals, expected freq’s, and (O-E)^2/E are automatically calculated 3. Check large sample assumption by making sure no expected frequency is 10 4. Find test statistic, X2 5. Find p value, pvalue EX: CHI SQUARE TEST OF INDEPENDENCE r = 2, c = 3 “2X3CHI2” template Ho: Goals are independent of gender A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gender B C Boy row % col % Girl row % col % total Grades 117 0.5154 0.4737 130 0.5179 0.5263 247 D Goal Popular 50 0.2203 0.3546 91 0.3625 0.6454 141 J P-value 1.37E-07 F crit 3.2594 Anova: Single Factor 1.292 66 I E F Sports 60 0.2643 0.6667 30 0.1195 0.3333 90 total 227 251 478 G H I J Observed 117 50 60 130 91 30 Expected 117.2992 66.9603 42.7406 129.7008 74.0397 47.2594 X2 df pvalue (O-E)^2/E 0.000763 4.295834 6.969661 0.00069 3.885077 6.303239 21.45526 2 2.19E-05 IV) Numeric vs. Numeric (comparing means response across a range of explanatory values) A) Simple Linear Regression -- If fail to reject Ho: Variances are equal 1. Create 2 columns of sample data, one column per variable, one row per observation 2. ToolsData Analysis Regression Input Y Range: Highlight second column (response variable if applicable) Input X Range: Highlight first column (explanatory variable if applicable) Labels if headings included Output Range: Click on cell that you want as the upper left corner of table Residual plots Line fit plots Standardized residuals Normal probability plots 3. Check Normality/no outlier assumption a. Normal Quantile (probability) plot of residuals b. Make sure no standardized residuals are less than –3 or more than 3 c. Make sure no patterns in residual plots 4. Check linear relationship assumption by examining Line fit plot 5. Find test statistic for Ho: Slope = 0 (no linear relationship), t Stat (in second row of third table) 6. Find p value of test statistic, P-value (in second row of third table) 7. Find predicted slope, coefficients (in second row of third table) EX: T TEST, ASSUMING UNEQUAL VARIANCES Ho: The mean velocity in which stellar objects move away from earth does not depend linearly on their distance from earth Input Y Range$B$1:$A$14, Input X Range$A$1:$A$14, Labels, Output Range $A$16, Residual plots, Line fit plots, Standardized residuals, Normal probability plot 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 A B distance velocity C 0.032 0.034 0.214 0.263 0.275 0.275 0.45 0.5 0.5 0.63 0.8 0.9 0.9 170 290 -130 -70 -185 -220 200 290 270 200 300 -30 650 D E G H I Normal Probability Plot 800 600 SUMMARY OUTPUT 400 200 0 -200 0 50 100 150 -400 Statistics 0.4306 0.1854 0.1114 233.4910 13 Sample Percentile distance Line Fit Plot 700 600 500 400 velocity Regression Multiple R R Square Adjusted R Square Standard Error Observations F velocity 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 velocity 300 200 100 Predicted velocity 0 ANOVA -100 df 1 11 12 Regression Residual Total SS 136570.384 599698.8467 736269.2308 Coefficients -25.6264 358.2441 Intercept distance MS 136570.384 54518.0769 Standard Error 119.5696 226.3451 t Stat -0.21432 1.582734 Residuals 184.162606 303.446118 -181.037827 -138.59179 -257.890720 -292.890720 64.4165539 136.504346 116.504346 -0.06739215 39.0311032 -326.7933 Standard Residuals 0.823806 1.357392 -0.809829 -0.619956 -1.153611 -1.310175 0.288151 0.610619 0.521154 -0.000301 0.174596 -1.461831 RESIDUAL OUTPUT Observation 1 2 3 4 5 6 7 8 9 10 11 12 F 2.5050 P-value 0.8342 0.1417 SignificanceF 0.1417 Lower 95% -288.79748 -139.93832 PROBABILITY Predicted velocity -14.162606 -13.446118 51.037827 68.591791 72.890720 72.890720 135.58344 153.49565 153.49565 200.06739 260.96889 296.79331 Percentile 3.846153 11.53846 19.23076 26.92307 34.61538 42.30769 50 57.69230 65.38461 73.07692 80.76923 88.46153 -200 0 0.5 1 -300 distance Upper 95% 237.544639 856.426606 OUTPUT velocity -220 -185 -130 -70 -30 170 200 200 270 290 290 300 Lower 95.0% -288.797477 -139.938316 Upper 95.0% 237.5446 856.4266