Two-sample tests Q: Is there a significant difference between the population means of two groups? • Are the data paired or not? – If paired, work with differences • If not paired – two independent samples – Are the variances approximately equal? – If approx equal variances use two-sample ttest with pooled variance estimate – If not equal, use two-sample t-test with unequal variances Elastic Band Stretching ambient 225 247 249 253 245 259 242 255 286 diff 19 8 4 1 6 10 6 -3 6 Scatterplot of heatedvs ambient 290 280 heated heated 244 255 253 254 251 269 248 252 292 270 260 250 240 220 230 240 250 260 ambient 270 280 290 Paired T for heated - ambient N Mean StDev SE Mean heated 9 257.556 14.604 4.868 ambient 9 251.222 16.285 5.428 Difference 9 6.33333 6.10328 2.03443 95% CI for mean difference: (1.64194, 11.02473) T-Test of mean difference = 0 (vs not = 0): TValue = 3.11 P-Value = 0.014 One-Sample T: difference … is equivalent to: Test of mu = 0 vs not = 0 Variable N Mean StDev SE Mean difference 9 6.33333 6.10328 2.03443 95% CI T (1.64194, 11.02473) P 3.11 0.014 Why is pairing useful? • If measurements on the same individual or unit, them pairing accounts for the individual variation which gives association between the measurements – Differencing removes the individual differences – Leads to a simple one-sample t-test. – Two-sample t-test is wrong – groups not independent Soil Contamination • Measure of growth inhibitor, phthalide on reclaimed (11 samples) and farm land (9 samples) • Is there any evidence of a difference in phthalide levels between reclaimed and farm land? – Null Hypothesis: no difference between population levels – Alternative Hypothesis: ? 1. Step: Check if variances are equal Boxplot of farm, reclaimed H0: Variances in both groups are equal 40 38 Data H1: Variances are not equal 42 36 34 32 30 farm reclaimed F-Test (normal distribution) Test statistic = 0.83, p-value = 0.802 Levene's Test (any continuous distribution) Test statistic = 0.07, p-value = 0.788 H0 is not rejected => One can apply a two-sample t-test with equal variances 2. Step: Two-sample t-test with equal variances N farm reclaimed 9 11 Mean 38.33 35.09 StDev 2.92 3.21 SE Mean 0.97 0.97 Difference = mu (farm) - mu (reclaimed) Estimate for difference: 3.24242 95% CI for difference: (0.33269, 6.15216) T-Test of difference = 0 (vs not =): T-Value = 2.34 P-Value = 0.031 DF = 18 Both use Pooled StDev = 3.0814 => There is a significant difference in phthalide levels between farm and reclaimed land Shellfish Data: Cadmium 1. Step: Check if variances are equal Test for Equal Variances for Cadmium SpeciesGroup F-Test Test Statistic P-Value M Levene's Test Test Statistic P-Value O 0.06 SpeciesGroup 0.33 0.000 0.08 0.10 0.12 0.14 0.16 95% Bonferroni Confidence Intervals for StDevs 0.18 M O 0.0 0.2 0.4 Cadmium 0.6 0.8 24.41 0.000 2. Step: Two-sample t-test with unequal variances Two-sample T for Cadmium SpeciesGroup M O N 89 79 Mean 0.1656 0.385 StDev 0.0870 0.151 SE Mean 0.0092 0.017 Difference = mu (M) - mu (O) Estimate for difference: -0.219192 95% CI for difference: (-0.257466, -0.180918) T-Test of difference = 0 (vs not =): T-Value = -11.34 P-Value = 0.000 DF = 121 Effect of log-transformation equal variances! Test for Equal Variances for log cadmium SpeciesGroup F-Test Test Statistic P-Value M Lev ene's Test Test Statistic P-Value O 0.40 SpeciesGroup 1.10 0.660 0.45 0.50 0.55 95% Bonferroni Confidence Intervals for StDevs 0.60 M O -3.5 -3.0 -2.5 -2.0 -1.5 log cadmium -1.0 -0.5 0.0 0.29 0.591 Non-parametric Tests • T-test make certain assumptions about the data • For comparing means normality is not so important – central limit theorem says that means tend to be normally distributed for “large samples” • Skewness, multiple modes, outliers may result in poor estimates of means and standard deviations • Non-parametric tests simply use the rank order of the data and make fewer assumptions • Give tests based on medians • But still make some assumptions!!! Correlation • Gives measure of linear association between two variables • Values between –1 and +1 • 0 no linear association • Correlation does not imply a causal relationship Can be a useful summary, but grossly overused. Plots are often more informative or regression modelling more useful. Mussel mass data Correlations: edible, height, width, length, shell edible height width height 0.881 width 0.910 0.922 length 0.879 0.946 0.941 shell 0.932 0.881 0.934 length 0.899 Large correlations between all variables • could use formal hypothesis test that correlation is non-zero, but not much point! • Pictures tell more – but not everything ….! Matrix Plot of edible, height, width, length, shell 80 120 160 160 240 320 40 edible 20 0 160 120 height 80 60 width 40 20 320 240 length 160 400 200 shell 0 0 20 40 20 40 60 0 200 400