Statistical hypothesis testing – Inferential statistics II.

advertisement
Statistical hypothesis testing –
Inferential statistics II.
Testing for associations
Three main topics
• Chi-square test of association
• Correlation analysis
• Linear regression analysis
Introduction
• Association: A general term used to describe the
relationship between two variables.
• If two variables are associated, the value of one
variable could be more or less guessed,
provided we know the value of the other
variable.
Briefly, they are NOT independent of each other
in a statistical sense.
• E.g:
– Colour of hair and eyes:
If someone’s hair is brown there is great likelihood his
eyes are brown too.
– Length and weight of fish:
The longer the fish the greater its weight is.
Chi-square test of association
• We use this test to examine the association
between two or more categorical (nominal or
factor) variables.
• Data should be arranged in a contingency table.
Variable 2
Categories of variable 2
Variable 1
15
5
6
12
Observed frequencies
of cases
Categories of variable 1
• Contingency tables test whether the patterns of
frequencies in one categorical variable differ
between different levels of other categorical
variable:
Could the variables be independent of one
another?
• H0: the observations are independent of one
another, that is the categorical variables are not
associated.
• Test statistic: chi2
• Null distribution: chi2 distribution
with df = (nb of rows - 1) × (nb of columns - 1)
Correlation analysis
• Correlation:
– It is a monotonous type of association:
The greater the value of one variable the greater
(positive correlation) or less (negative correlation) the
value of the other variable is.
• The scale of measurement of the two variables
need to be at least ordinal scale.
• There is no distinction between dependent and
independent variables => there is no attempt to
interpret the causality of the association.
• Two frequently used types of correlation:
– Pearson’s product-moment correlation
– Spearman’s rank correlation.
• Pearson’s product-moment correlation
– Correlation coefficient:
It measures the strength of the relationships between
two variables.
[-1 < r < 1]
r = -1 perfect negative correlation
r = 1 perfect positive correlation
r = 0 there is no correlation
– H0: r = 0
H1: r != 0
– Assumptions:
• Both variables are measured on a continuous scale.
• Both variables are normally distributed.
If assumptions are not met Spearman’s rank
correlation should be used.
• Spearman’s rank correlation
– Actually it is the same correlation as
Pearson’s one but computed on the basis of
ranks of the data.
Regression analysis
• We assume that there is dependence structure
between the variables:
– dependent (response) variable (Y) – effect
– independent (explanatory or predictor) variable (X) – cause.
• Aim of the analysis: describe the relationship
between Y and X in a function form.
This function can be used for prediction.
• Simple linear regression:
there is only one X variable in the model:
Y = b0 + b1X1
• Multiple linear regression:
there are two or more X variables in the model:
Y = b0 + b1X1 + b2X2 + … + bpXp
• Simple linear regression model:
• Simple linear regression:
Parameters of the model:
• β0: the value of y when x = 0
(y-intercept)
y = 2 + 1.5x
• β1: the degree to which y
changes
per unit of change in x
(gradient of line, i.e. regression
slope)
• Hypothesis tests in simple linear regression:
– F-test: the general test of the model
– t-test for zero intercept:
• H0: β0 = 0
• H1: β0 != 0
– t-test for zero slope (result of it is the same as that of
the F-test in simple linear regression):
• H0: β1 = 0
There is no relationship between X and Y.
• H1: β1 != 0
There is a relationship between X and Y.
• Coefficient of determination (R2):
– Gives the proportion of the variation in Y that is
accounted for by X.
• Residuals of the model (error):
– The variation in the data left over after the linear regression model has
been accounted for.
• Model validation process:
– After applying a model on our data, we need to check if the
assumptions of linear regression analysis are met.
– This can be done by examining the residuals of our fitted model.
• Assumptions of the linear regression model:
– Independence: observations are independent of one another.
– Normality: it means that the populations of Y-values and the error
terms (εi) are normally distributed for each value of the predictor
variable xi.
– Homogeneity of variance: is means that the populations of
Y-values, and the error terms (εi), have the same variance for each xi.
Download