T Tests and ANovas

advertisement
Jennifer Siegel
Statistical background
Z-Test
T-Test
Anovas
 Science tries to predict the future
 Genuine effect?
 Attempt to strengthen predictions with stats
 Use P-Value to indicate our level of certainty that result =
genuine effect on whole population (more on this later…)
 Develop an experimental hypothesis
 H0 = null hypothesis
 H1 = alternative hypothesis
 Statistically significant result
 P Value = .05
 Probability that observed result is true
 Level = .05 or 5%
 95% certain our experimental effect is genuine
 Type 1 = false positive
 Type 2 = false negative
 P = 1 – Probability of Type 1 error
 Let’s pretend you came up with the following theory…
Having a baby increases brain volume (associated with
possible structural changes)
Z - test
T - test
 Population
z
x

 Cost
 Not able to include everyone
 Too time consuming
 Ethical right to privacy
Realistically researchers can only do sample based
studies
 T = differences between sample means / standard error
of sample means
 Degrees of freedom = sample size - 1
t
differences _ between _ sample _ means
estimated _ standard _ error _ of _ differences _ between _ means
x1  x 2
t
s x1  x2
2
s x1  x2
2
s1 s 2


n1 n2
 H0 = There is no difference in brain size before or after
giving birth
 H1 = The brain is significantly smaller or significantly
larger after giving birth (difference detected)
Sum
Mean
SD
Before Delivery
1437.4
1089.2
1201.7
1371.8
1207.9
1150.7
1221.9
1208.7
9889.3
1236.1625
113.8544928
6 Weeks After Delivery
1494.5
1109.7
1245.4
1383.6
1237.7
1180.1
1268.8
1248.3
10168.1
1271.0125
119.0413426
Difference
57.1
20.5
43.7
11.8
29.8
29.4
46.9
39.6
278.8
34.85
5.18685
T=(1271-1236)/(119-113)
T
DF
6.718914454
7
Women have a significantly larger brain after giving birth
http://www.danielsoper.com/statcalc/calc08.aspx
 One-sample (sample vs. hypothesized mean)
 Independent groups (2 separate groups)
 Repeated measures (same group, different
measure)
 ANalysis Of VAriance
 Factor = what is being compared (type of pregnancy)
 Levels = different elements of a factor (age of mother)
 F-Statistic
 Post hoc testing
 1 Way Anova
 1 factor with more than 2 levels
 Factorial Anova
 More than 1 factor
 Mixed Design Anovas
 Some factors are independent, others are related
 There is a significant difference somewhere between
groups
 NOT where the difference lies
 Finding exactly where the difference lies requires
further statistical analysis = post hoc analysis
 Z-Tests for populations
 T-Tests for samples
 ANOVAS compare more than 2 groups in more
complicated scenarios
Varun V.Sethi
Objective
Correlation
Linear Regression
Take Home Points.
Correlation
- How much linear is the relationship of two
variables? (descriptive)
Regression
- How good is a linear model to explain my data?
(inferential)
Correlation
Correlation reflects the noisiness and direction of a linear relationship (top row),
but not the slope of that relationship (middle), nor many aspects of nonlinear
relationships (bottom).
 Strength and direction of the relationship between
variables
 Scattergrams
Y
Y
Y
X
Positive correlation
Y
Y
Y
X
Negative correlation
No correlation
Measures of Correlation
1) Covariance
2) Pearson Correlation Coefficient (r)
1) Covariance
- The covariance is a statistic representing the degree to which 2
variables vary together
n
cov( x, y ) 
 ( x  x)( y  y)
i 1
i
i
n
{Note that Sx2 = cov(x,x) }
 A statistic representing the degree to which 2 variables
vary together
 Covariance formula
n
cov( x, y ) 
 ( x  x)( y
i 1
i
i
n
n
 cf. variance formula
S x2 
2
(
x

x
)
 i
i 1
n
 y)
2) Pearson correlation coefficient (r)
cov( x, y)
rxy 
sx s y
(S = st dev of sample)
-
r is a kind of ‘normalised’ (dimensionless) covariance
-
r takes values fom -1 (perfect negative correlation) to 1 (perfect
positive correlation). r=0 means no correlation
Limitations:
Sensitive to extreme values
Relationship not a prediction.
Not Causality
Regression: Prediction of one variable from knowledge of one or
more other variables
How good is a linear model (y=ax+b) to explain the relationship of two
variables?
- If there is such a relationship, we can ‘predict’ the value y for a given x.
(25, 7.498)
Linear dependence between 2 variables
Two variables are linearly dependent when the increase of one variable
is proportional to the increase of the other one
y
x
Samples:
- Energy needed to boil water
- Money needed to buy coffeepots
Fiting data to a straight line (o viceversa):
Here, ŷ = ax + b
– ŷ : predicted value of y
– a: slope of regression line
– b: intercept
ŷ = ax + b
εi
= ŷi, predicted
= yi , observed
εi = residual
Residual error (εi): Difference between obtained and predicted values of y (i.e. yi- ŷi)
Best fit line (values of b and a) is the one that minimises the sum of squared errors
(SSerror) (yi- ŷi)2
Adjusting the straight line to data:
• Minimise (yi- ŷi)2 , which is (yi-axi+b)2
• Minimum SSerror is at the bottom of the curve where the gradient is zero
– and this can found with calculus
• Take partial derivatives of (yi-axi-b)2 respect parametres a and b and
solve for 0 as simultaneous equations, giving:
rs y
a
sx
• This can always be done
b  y  ax
 We can calculate the regression line for any data, but how well does it fit the
data?
 Total variance = predicted variance + error variance
sy2 = sŷ2 + ser2
 Also, it can be shown that r2 is the proportion of the variance in y that is
explained by our regression model
r2 = sŷ2 / sy2
 Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get:
ser2 = sy2 (1 – r2)
From this we can see that the greater the correlation
the smaller the error variance, so the better our
prediction
 Do we get a significantly better prediction of y
from our regression equation than by just
predicting the mean?
F-statistic
 Prediction / Forecasting
 Quantify strength between y and Xj ( X1, X2, X3 )
 A General Linear Model is just any model that
describes the data in terms of a straight line
 Linear regression is actually a form of the General
Linear Model where the parameters are b, the slope of
the line, and a, the intercept.
y = bx + a +ε
 Multiple regression is used to determine the effect of a
number of independent variables, x1, x2, x3 etc., on a single
dependent variable, y
 The different x variables are combined in a linear way and
each has its own regression coefficient:
y = b0 + b1x1+ b2x2 +…..+ bnxn + ε
 The a parameters reflect the independent contribution of
each independent variable, x, to the value of the dependent
variable, y.
 i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
Take Home Points
- Correlated doesn’t mean related.
e.g, any two variables increasing or decreasing over time would show a
nice correlation: C02 air concentration in Antartica and lodging rental cost
in London. Beware in longitudinal studies!!!
- Relationship between two variables doesn’t mean causality
(e.g leaves on the forest floor and hours of sun)
 Linear regression is a GLM that models the effect of one
independent variable, x, on one dependent variable, y
 Multiple Regression models the effect of several
independent variables, x1, x2 etc, on one dependent
variable, y
 Both are types of General Linear Model
Thank You
Download