Basics of Statistics

advertisement
Student’s t statistic
 Use
 Test for equality of two means
 E.g., compare two groups of subjects given
different treatments
 Test for value of a single mean
 E.g., test to see if a single group of subjects
differs from a known value
 Also ‘matched sample’ test where a single
group is compared before and after treatment
(test for zero treatment effect)
 Advanced
 Tests of significance of correlation/regression
coefficients.
Student’s t statistic
 Assumptions
 Parent population is normal
 Sample observations (subjects) are
independent.
 Robustness
 To normality: Affects Type I error and power
and may lead to inappropriate
interpretation. In real life, we can’t expect
exactly normal data but it should not be too
much skewed
Student’s t statistic
 Formula (single group)
Let x1, x2, ….xn be a random sample from a normal
population with mean µ and variance σ2, then the
following statistic is distributed as Student’s t with
(n-1) degrees of freedom.
x
t
s/ n
Student’s t statistic
 Formula (two groups)


Case 1: Two matched samples
The following statistic follows t distribution with n-1 d.f.
d
t
sd / n
Where, d is the difference of two matched samples and Sd is
the standard deviation of the variable d.
Student’s t statistic

Formula (two groups)

Case 2: Equal Population Standard Deviations:

The following statistic is distributed as t distribution with (n1+n2 -2)
d.f.
t
( x1  x2 )
1 1
Sp

n1 n2
The pooled standard deviation,
(n1  1) S12  (n2  1) S 22
Sp 
n1  n2  2
n1 and n2 are the sample sizes and S1 and S2 are the sample
standard deviations of two groups.
Student’s t statistic
 Formula (two groups)
 Case 3: Unequal population standard deviations
 The following statistic follows t distribution.
t
( x1  x2 )  ( 1   2 )
s12 s22

n1 n2
 The d.f. of this statistic is,
s

2
/ n1  s / n2
v 2
( s1 / n1 ) 2 ( s22 / n2 ) 2

n1  1
n2  1
2
1
2
2
Student’s t statistic
 One-sided
 There can only be on direction of effect
 The investigator is only interested in one
direction of effect.
 Greater power to detect difference in
expected direction
 Two-sided
 Difference could go in either direction
 More conservative
Student’s t statistic
One group
Two groups
One sided
A single mean differs
Two means differ from
from a known value in a one another in a
specific direction. e.g.
specific direction. e.g.,
mean > 0
mean2 < mean1
Two sided
A single mean differs
from a known value in
either direction. e.g.,
mean ≠ 0
Two means are not
equal. That is, mean1 ≠
mean2
Student’s t statistic
 SPSS
 One Group: Analyze>Compare Means> OneSample T Test
 Two Groups (Matched Samples):
Analyze>Compare Means> Paired Samples T
Test
 Two Groups: Analyze>Compare Means>
Independent Samples T Test
Student’s t statistic
 R
 The default t-test is
t.test(x, y = NULL, alternative = "two.sided", mu = 0,
paired = False, var.equal = FALSE, conf.level = 0.95)
Where x and y are two data for two numeric variables.
We need to change only default settings matching with
the case we want to perform. For example,
 One Group: t.test(x, alternative=“greater”, mu=30)
 Two Groups (Matched Samples): t.test(x, y,
alternative= "less", mu = 0, paired = TRUE,)
 Two Groups: t.test(x,y, alternative=“greater”, mu=0,
var.equal = TRUE)
Student’s t-statistic
 MS Excel (in Tools -> Data Analysis…)
 One Group: Not available
 Two Groups (Matched Samples):
 t-Test: Paired two sample for mean
 Two Groups (Independent Samples):
 t-Test: Two-Sample Assuming Equal Variances
 t-Test: Two-Sample Assuming Unequal Variances
Example 1
Consider the heights of children 4 to 12 years
old in dataset 1 of our course website
(variable ‘hgt’). Suppose we want to test if
the average height (µ) for this age group in
the population is 50 inches, using our sample
of 60 children. We will use 5% level of
significance.
This is a one-sample, two-sided test.
Example 1
 Hypotheses:
H0: µ = 50
Ha: µ ≠ 50
 Computation in Excel:
 Excel does not have a 1-sample test, but we can
fool it.
 Create a dummy column parallel to the hgt column
with an equal number of cells, all set to 0.0
 Run the Matched sample test using hgt and the
dummy column and 50 as the hypothesized mean
difference.
 The p-value for two tail test is 0.0092
Example 1
 Using SPSS:
 Analyze> Caompare Means >One Sample T
Test > Select hgt > Test value: 50 > ok
 P-value is .009
 Using R,
 t.test(df1$hgt, mu=50)
 Two-tail p-value is .0092
Example 2
 Suppose we want to compare the height of two
groups (hgt in each sex from dataset).
 H0: Mean heights are equal for the two sexes.
 Ha: Mean heights are not equal
 Using MS-Excel:





Sort data by sex (data>sort>by:sex)
In Data Analysis… t-test:Two-sample Assuming equal variance
select the range of hgt for all sex = f as Variable 1 Range
select the range of hgt for all sex = m as Variable 2 Range
P-value for two-sided test = 0.205
Example 2
 Using SPSS:
 Analyze>Compare Means>Independent-Samples Ttest>
 Select hgt as a Test Variable
 Select sex as a Grouping Variable
 In Define Groups, type f for Group 1 and m for Group
2
 Click Continue then OK
 It gives us the p-value 0.205. We can assume equal
variance as the p-value of F statistic for testing
equality of variances is 0.845.
Sign Test (Nonparametric)
 Use:
(1) Compare the median of a single group with a
specified value (instead of single sample t-test).
(2) Compare medians of two matched groups
(instead of Two matched samples t-test)
 Test Statistic:
Number of positive difference of (median-c). The
number of positive difference follows a Binomial
distribution.
Sign Test (Nonparametric)
 SPSS: Analyze> Nonparametric Tests>
Binomial
 R: sign.test(x, y = NULL, md = 0,
alternative = "two.sided", conf.level =
0.95)
 For testing the median (md) of a single
sample, use data only for one variable.
 To compare paired data, use two paired
variables.
 NB: This test requires the BSDA package
Wilcoxon Signed-Rank Test:
 USE:
 Compares medians of two paired samples.
 Test Statistic:
Consider n pairs of data of two variables x
and Y, then the following statistic is known
as Wilcoxon signed rank statistic.
WS = Sum of the rank of positive
differences after assigning ranks to the
absolute value of differences.
Wilcoxon Rank-Sum Test
 Use: Compares medians of two
independent groups.
 Test Statistic:
Let, X and Y be two samples of sizes m and
n. Suppose N=m+n. Compute the rank of all
N observations. Then, the statistic,
Wm= Sum of the ranks of all observations of
variable X.
Wilcoxon Signed-Rank Test &
Wilcoxon Rank-Sum Test
 SPSS:
 Two Matched Groups: Analyze>
Nonparametric Tests> 2 Related Samples
 Two Groups: Analyze> Nonparametric
Tests> 2 Independent Samples
Wilcoxon Signed-Rank Test:
/Wilcoxon Rank-Sum Test
 R:
 The default test is
wilcox.test(x, y, alternative = "two.sided", mu
= 0, paired = FALSE, exact = FALSE, conf.int =
FALSE, conf.level = 0.95)
 Two matched Groups: wilcox.test(x, y, alternative =
“less", paired = TRUE)
 Two Groups: wilcox.test(x, y, alternative =
“greater“)
Example 3 (two matched samples)
Subject
Hours of Sleep
Difference
Rank
Ignoring Sign
Drug
Placebo
1
6.1
5.2
0.9
3.5
2
7.0
7.9
-0.9
3.5
3
8.2
3.9
4.3
10
4
7.6
4.7
2.9
7
5
6.5
5.3
1.2
5
6
8.4
5.4
3.0
8
7
6.9
4.2
2.7
6
8
6.7
6.1
0.6
2
9
7.4
3.8
3.6
9
10
5.8
6.3
-0.5
1
3rd & 4th ranks are tied hence averaged.
P-value of this test is 0.02. Hence the test is significant at any level more
than 2%, indicating the drug is more effective than placebo.
Proportion Tests
 Use
 Test for equality of two Proportions
 E.g. proportions of subjects in two treatment
groups who benefited from treatment.
 Test for the value of a single proportion
 E.g., to test if the proportion of smokers in a
population is some specified value (less than 1)
Proportion Tests
 Formula
 One Group:
z
 Two Groups:
z
pˆ  p0
p0 (1  p0 )
n
pˆ 1  pˆ 2
1 1
pˆ (1  pˆ )(  )
n1 n2
x1  x2
where pˆ 
.
n1  n2
Proportion Test
 SPSS:
 One Group: Analyze> Nonparametric Tests> Binomial
 Two Groups?
 R:
 The default tests are:
 One Group: binom.test(x, n, p = 0.5, alternative =
"two.sided", conf.level = 0.95)
 Two Groups: prop.test(c(x,y), c(m,n), p = NULL,
alternative = "two.sided", conf.level = 0.95, correct
= TRUE)
 X, Y are the number of successes and m and n
are the sample sizes
Example 4: Proportion of males in
Dataset 1
 R:
 n=60 and there are 30 males
 binom.test(30,60) returns a p-value of 1.0.
 SPSS:
 recode sex as numeric  Transform> Recode>Into Different Variables> Make
all selections there and click on Change after
recoding character variable into numeric.
 Analyze> Nonparametric test> Binomial> select Test
variable> Test proportion
 Set null hypothesis = 0.5
 The p-value = 1.0
Chi-square statistic
 USE
 Testing the population variance σ2= σ02.
 Testing the goodness of fit.
 Testing the independence/ association of attributes
 Assumptions
 Sample observations should be independent.
 Cell frequencies should be >= 5.
 Total observed and expected frequencies are
equal
Chi-square statistic
 Formula: If xi (i=1,2,…n) are independent
and normally distributed with mean µ and
standard deviation σ, then,
 xi   
2
is
a

distributi on with n d.f.



 
i 1 
n
2
 If we don’t know µ, then we estimate it using
a sample mean and then,
 xi  x 
2
is
a

distributi on with (n - 1) d.f.



 
i 1 
n
2
Chi-square statistic
 For a contingency table we use the following
chi- square test statistic,
2
(
O

E
)
i
2   i
, distribute d as  2 with (n - 1) d.f.
Ei
i 1
n
Oi  Observed Frequency
Ei  Expected Frequency
Chi-square statistic
 SPSS:
 Analyze> Descriptive stat> Crosstabs>
statistics> Chi-square
 Select variables.
 Click on Cell button to select items you
want in cells, rows, and columns.
Example 5 (class demonstration)
 Make a contingency table using two variables
sex and grp from our dataset.
 Analyze> Descriptive statistics> crosstabs>
 select variables for rows and columns
 Statistics> Chi-square> Continue> Cells>
selection> ok.
 It will give us a contingency table and p-value
of Pearson Chi-square Tests.
 For this particular case, the p-value of PearsonChi-square test is 0.549 and d.f. is 2.
F-statistic
 Use:
 Testing the equality of population
variances.
 Testing the significance of difference of
several means in analysis of variance.
F-statistic
 Let X and Y be two independent Chi-square variables with
n1 and n2 d.f. respectively, then the following statistic
follows a F distribution with n1 and n2 d.f.
Fn1 ,n2
X / n1

Y / n2
 Let, X and Y are two independent normal variables with
sample sizes n1 and n2. Then the following statistic follows
a F distribution with n1 and n2 d.f.
Fn1 ,n2
s x2
 2
sy
Where, sx2 and sy2 are sample variances of X and Y.
F-statistic
 Hypotheses:
H0: µ1= µ2=…. =µn
Ha: µ1≠ µ2 ≠ …. ≠µn
Comparison will be done using analysis of
variance (ANOVA) technique. ANOVA uses F
statistic for this comparison. The ANOVA
technique will be covered in another class
session.
Download