Learning Objectives: Be able to get means, SDs, frequencies and histograms for any variable Be able to determine the Cronbach’s alpha for any set of items. Be able to reverse score a single item and combine items into a single scale 1. Check for errors (range, distribution) 2. Reverse code any relevant items 3. Estimate reliability (Cronbach’s alpha) 4. Combine items Be able to correlate any two variables (represented as single item or as scale). Be able to “center” a variable before entering it as a predictor in a multiple regression equation. Be able to distinguish between mediators and moderators. Be able to draw simple path diagrams. Be able to create “product” terms to test interaction questions. Be able enter predictors in multiple steps to test mediation questions. Two favorite resources: http://www.quantpsy.org/medn.htm (Kristopher Preacher’s website) http://www.afhayes.com/ (Andrew Hayes’ website) Words of wisdom (adapted from Allison, P., 1999): The most popular kind of regression is ordinary least squares, but there are other methods. Ordinary multiple regression is called linear because it can be represented graphically by a straight line. A linear relationship between two variables is usually described by two numbers, slope and intercept. We assume relationships are linear because it is the simplest kind of relationship and there’s usually no good reason to consider something more complicated (principle of parsimony). You need more cases than variables, ideally a lot more cases. Ordinal variables are not well represented by multiple regressions. Ordinary least squares choose the regression coefficients (slopes and intercept) to minimize the sum of the squared prediction errors. The R2 is the statistic most often used to measure how well the outcome/criteria variable can be predicted form knowledge of the predictor variables. To evaluate the least squares estimates of the regression coefficients, we usually rely on confidence intervals and hypothesis tests. Multiple regression allows us to statistically control for measured variables but this control will never be as good as a randomized experiment. To interpret the numerical value of a regression coefficient, it is essential to understand the metrics of the outcome/criteria and predictor variables. Coefficients for dummy (0, 1) variables usually can be interpreted as differences in means on the outcome variable for the two categories of the predictor variable, controlling for other variables in the regression model. Standardized coefficients can be compared across predictor variables with different units of measurement. They tell how many standard deviations the outcome variable changes as an increase of one standard deviation in the predictor variables. The intercept (or constant) in a regression model rarely tells you anything interesting. Don’t exaggerate the importance of R2 in evaluating a regression model. A model can still be worthwhile even if R2 is low. In a multiple regression, there is no distinction among different kinds of predictor variables, nor do the results depend upon the order in which the variables appear in the model. Varying the set of variables in the regression model can be helpful in understanding the causal relationships. If the coefficient for a variable x goes down when other variables are entered, it means that either a) the other variables mediate the effect of x on the outcome variable y, or b) the other variables affect both x and y and therefore the original coefficient of x was partly spurious. If a dummy predictor variable has nearly all the observations in one of its two categories, even large effect of this variable may not show up as statistically significant. If the global F test for all the variables in a multiple regression model is not statistically significant, you should be very cautious about drawing any additional conclusion about the variables and their effects. When multiple dummy variables are used to represent a categorical variable with more than two categories, it is crucial in interpreting the results to know which is the omitted category. Leaving important variables out of a regression model can bias the coefficients of other variables and lead to spurious conclusions. Non-experimental data tells you nothing about the direction of a causal relationship. You must decide the direction based on your prior knowledge of the phenomenon you are studying, Time ordering usually gives us the most important clues about the direction of causality. Measurement error in predictor variables leads to bias in the coefficients. Variables with more measurement error tend to have coefficients that are biased toward 0. Variables with little or no measurement error tend to have coefficients that are biased away from 0. The degree of measurement error in a variable is usually quantified by an estimate of its reliability, a number between 0 and 1. A Cronbach’s alpha of 1 indicated perfect measure, a Cronbach’s alpha of 0 indicates that the variation in the variable is pure error. Most psychologists prefer alphas to be above .70, anything below .60 is difficult to defend. With small samples, even large regression coefficients may not be statistically significant. In such cases, you are not justified in concluding that the variable has no effect – the sample may not have been a large enough sample to detect it. In small samples, the approximations used to calculate p values may not be very accurate, so be cautious in interpreting them. In a large sample, even trivial effects may be statistically significant. You need to look carefully at the magnitude of each coefficient to determine whether it is large enough to be substantively interesting. When the measurement scale of the variable is unfamiliar, standardized coefficients can be helpful in evaluating the substantive significance of a regression coefficient. If you are interested in the effect of x on y, but the regression model also includes intervening (mediating) variables w and z, the coefficient for x may be misleadingly small You have estimated the direct effect of x on y, but you have missed the indirect effects through w and z. If intervening variables w and z are deleted from the regression model, the coefficient for x represents its total effect on y. The total effect is the sum of the direct and indirect effect. If two or more predictor variables are highly correlated it is difficult to get good estimates of the effect of each variable controlling for the others. This problem is known as multicollinearity. When two predictor variables are highly collinear, it is easy to incorrectly conclude that neither has an effect on the outcome variable. It is important to consider whether the sample is representative of the intended population. Check data to make sure all values falls within range of variable response categories. Default method for handling missing data is listwise deletion – deleting any case that has missing data on any variable. Studentizied residuals are useful for finding observations with large discrepancies between the observed and predicted values. Influence statistics tell you how much the regression results would change if a particular observation were deleted from the sample. The standard error of the regression slope is used to calculate confidence intervals. A large standard error means an unreliable estimate of the coefficient. The standard error goes up with the variance in y. The standard error goes down with the sample size, the variance in x and the R2. The ratio of the slope coefficient to its standard error is a t statistic, which can be used to test the null hypothesis that the coefficient is 0. With samples larger than 100, a t statistic great than 2 (or less than -2) means that the coefficient is statistically significant (.05 level, two tailed test). The regression of y on x produces a different regression line from the regression of x on y – the two lines cross at the intersection of the two means. The two slopes will always have the same sign. We can get a confidence interval around a regression coefficient by adding and subtracting twice its standard error. The two lies coincide when and x and y are perfectly correlated. Efficient estimate methods have standard errors that are as small as possible. That means that in repeated sampling, they do not fluctuate much around the true value. If we have a probability sample drawn so that every individual in the population has the same chance of being chosen, then the least squares regression in the sample is an unbiased estimate of the least squared regression in the population. The standard linear model has five assumptions about how the values of the outcome variable are generated from the predictor variables. The assumptions of linearity and mean independence imply that least squares is unbiased. The additional assumptions of homoscedascity and uncorrelated errors imply that least squares is efficient. The normality assumption implies that a t table gives valid p values for hypothesis tests. The disturbance U represents all unmeasured causes of the predictor variable y. It is assumed to be a random variable, having an associated probability distribution. Mean independence means that the mean of the random disturbance U does not depend on the values of the x variables. Mean independence of U is the most critical assumption because violations can produce severe bias. Homoscedasity means the degree of random noise in the relationship between y and x is always the same. You can check for violations of the homoscedasity assumption by plotting the residuals against the predicted values of y. There should be a uniform degree of scatter, regardless of the predicted values. If observations are clusters or can interact, more likely to have correlated errors. Correlated errors lead to underestimates of standard errors which inflate test statistics (so you should use generalized least squares). Normality least critical, particularly if large sample. Tolerance is computed by regressing each independent variable on all the other independent variables and then subtracting the R2 from that regression from 1. A low tolerance indicates serious multicollinearity. Multicollinearity means that it is hard to get reliable coefficients. Can suppress effects. To address multicollinearity, delete one or more variables, combine variables into index, estimate latent variable model or perform joint hypotheses tests. The most common way to represent interaction in a regression model is to add a new variable that is the product of two variables already in the model. Such models implicitly say that the slope for each of the two variables is a linear function of the other variable. In models with a product term, the main effect coefficients represent the effect of that variable when the other variable has a value of 0. The best way to interpret a regression model with a product term is to calculate the effects of each of the two variables for a range of different values on the other variable. When one of the variables is a dummy variable, the model can be interpreted by splitting up the regression into two separate regressions, one for each of the two values of the dummy variable.