Statistical hypothesis testing – Inferential statistics II. Testing for associations Three main topics • Chi-square test of association • Correlation analysis • Linear regression analysis Introduction • Association: A general term used to describe the relationship between two variables. • If two variables are associated, the value of one variable could be more or less guessed, provided we know the value of the other variable. Briefly, they are NOT independent of each other in a statistical sense. • E.g: – Colour of hair and eyes: If someone’s hair is brown there is great likelihood his eyes are brown too. – Length and weight of fish: The longer the fish the greater its weight is. Chi-square test of association • We use this test to examine the association between two or more categorical (nominal or factor) variables. • Data should be arranged in a contingency table. Variable 2 Categories of variable 2 Variable 1 15 5 6 12 Observed frequencies of cases Categories of variable 1 • Contingency tables test whether the patterns of frequencies in one categorical variable differ between different levels of other categorical variable: Could the variables be independent of one another? • H0: the observations are independent of one another, that is the categorical variables are not associated. • Test statistic: chi2 • Null distribution: chi2 distribution with df = (nb of rows - 1) × (nb of columns - 1) Correlation analysis • Correlation: – It is a monotonous type of association: The greater the value of one variable the greater (positive correlation) or less (negative correlation) the value of the other variable is. • The scale of measurement of the two variables need to be at least ordinal scale. • There is no distinction between dependent and independent variables => there is no attempt to interpret the causality of the association. • Two frequently used types of correlation: – Pearson’s product-moment correlation – Spearman’s rank correlation. • Pearson’s product-moment correlation – Correlation coefficient: It measures the strength of the relationships between two variables. [-1 < r < 1] r = -1 perfect negative correlation r = 1 perfect positive correlation r = 0 there is no correlation – H0: r = 0 H1: r != 0 – Assumptions: • Both variables are measured on a continuous scale. • Both variables are normally distributed. If assumptions are not met Spearman’s rank correlation should be used. • Spearman’s rank correlation – Actually it is the same correlation as Pearson’s one but computed on the basis of ranks of the data. Regression analysis • We assume that there is dependence structure between the variables: – dependent (response) variable (Y) – effect – independent (explanatory or predictor) variable (X) – cause. • Aim of the analysis: describe the relationship between Y and X in a function form. This function can be used for prediction. • Simple linear regression: there is only one X variable in the model: Y = b0 + b1X1 • Multiple linear regression: there are two or more X variables in the model: Y = b0 + b1X1 + b2X2 + … + bpXp • Simple linear regression model: • Simple linear regression: Parameters of the model: • β0: the value of y when x = 0 (y-intercept) y = 2 + 1.5x • β1: the degree to which y changes per unit of change in x (gradient of line, i.e. regression slope) • Hypothesis tests in simple linear regression: – F-test: the general test of the model – t-test for zero intercept: • H0: β0 = 0 • H1: β0 != 0 – t-test for zero slope (result of it is the same as that of the F-test in simple linear regression): • H0: β1 = 0 There is no relationship between X and Y. • H1: β1 != 0 There is a relationship between X and Y. • Coefficient of determination (R2): – Gives the proportion of the variation in Y that is accounted for by X. • Residuals of the model (error): – The variation in the data left over after the linear regression model has been accounted for. • Model validation process: – After applying a model on our data, we need to check if the assumptions of linear regression analysis are met. – This can be done by examining the residuals of our fitted model. • Assumptions of the linear regression model: – Independence: observations are independent of one another. – Normality: it means that the populations of Y-values and the error terms (εi) are normally distributed for each value of the predictor variable xi. – Homogeneity of variance: is means that the populations of Y-values, and the error terms (εi), have the same variance for each xi.