Lab #12 TESTS FOR RELATIONSHIPS: MULTIPLE REGRESSION A natural extension of bivariate regression is the step to multiple regression. In this application, the word multiple means more than one independent variable or multi-factor. As you are well aware of by now, most effects in the social world have more than one cause, and most of the time those multiple causes are not additive, but instead they interact with one another to produce an overall outcome. For instance, a country may have a very high per capita gross domestic product, possibly due to a recently discovered natural resource (such as oil), but if the dietary habits of its citizens are very poor, then the high per capita GDP will not have a strong effect in reducing infant mortality rates. SPSS can easily handle additional independent variables in the OLS regression test and the reporting is only slightly adjusted to accommodate the additional variables. As always, state your independent variables first. Let's expand a research question and read the output for a model that includes multiple independent variables: The Independent Variables are listed in series "Is there a relationship between country’s population, per capita gross domestic product, and daily calories per person, with the infant mortality rate? " The Dependent Variable is stated last Go to the dialog box for linear regression as you did before, but add all of the independent variables into the box: -1- One part of the output looks like this: Model Summary Model 1 R .793a Adjusted R Square .614 R Square .630 Std. Error of the Estimate 24.0680 a. Predictors: (Constant), Gross domes tic product / capita, Population in thousands, Daily calorie intake R square: R2 will sound much the same as it does in a bivariate regression except that you will be reporting on the Adjusted R2 in a multiple regression model. The adjusted R2 considers the additional complexities of multiple independent variables in the research model, so we'll refer to all of them in our report: "Population, per capita GDP, and daily calorie intake per person together explain about 61% of the variance in infant mortality rates of the world's nations." In a multiple regression you are stating how much of the dependent variable is explained by all of the independent variables in the model. Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Population in thousands Gross domestic product / Std. Error 166.171 18.323 2.669E-6 .000 -.001 -.041 Coefficients Beta t Sig. 9.069 .000 .012 .166 .869 .001 -.242 -2.208 .030 .007 -.594 -5.429 .000 capita Daily calorie intake a. Dependent Variable: Infant mortality (deaths per 1000 live births) SLOPE: We report on the slope for each independent variable separately, but we have to add an important qualifier each time we state it. However, in this example we will also have to unravel the scientific notation in this regression coefficient. Simply move the decimal point the number of places noted after the “+” or “-” sign following the ‘E’ in the regression coefficient (in the case of Population in thousands: -6). 2.669E-6 = .000002669 Also, this is where it is necessary to know your units of measurement. In this case, population in thousands is measured in a thousand people. The multiplication rule: It would be nearly impossible to state the slope for population the way it is currently presented: For each additional thousand people in a nation, it is predicted there would be a .0000027 increase in infant deaths per 1000 births. That is just unmanageable. We can solve this problem by multiplying both variables by factors of 10 until the sentence has a more useful interpretation. The important rule to follow is: Whatever amount one variable is multiplied, the other variable must also be multiplied by that same amount. Another more general rule is to examine the range of values for the independent variable and make a judgment about reasonable increases. So, we start multiplying until we have numbers that make more sense: Multiplier x10 x100 x1000 x10,000 x100,000 I.V. 10,000 people 100,000 1,000,000 10,000,000 100,000,000 D.V. .000027 infant deaths .00027 .0027 .027 .27 If we were to look at the X axis of a histogram, we could see that increments of 10 million or 100,000 million for the population variable makes sense: So, it would sound like this: "For each additional 10 million people in the population of a nation, there is predicted to be a .027 increase in infant deaths per 1000 births (p= .869);” or, "For each additional 100 million people in the population of a nation, there is predicted to be a .27 increase in infant deaths per 1000 births (p= .869)”… but there’s moreď -3- There is an important addition we must include in our statement for a multiple regression model: It sounds like this: "For each additional 10 million people in the population of a nation, there is predicted to be a .027 increase in infant deaths per 1000 births, holding constant for per capita GDP and daily calories per person (p= .869)." The qualifier, "holding constant for..." is the verbal way of accounting for the addition of extra independent variables in the regression model. Each independent variable is handled separately in the same fashion: In the next case, Gross domestic product is measured in dollars. If we read the output as it is, we would state, "For each additional dollar in a country's gross domestic product it is predicted there would be a .001 decrease in infant deaths per 1000 births." Even though the relationship is statistically significant, .001 deaths per 1000 births appears almost imperceptible. This is because the independent variable, gross domestic product, is being measured in $1 increments, which is also imperceptibly small. In this case (not otherwise shown in the output), gross domestic product has a range of $122 (Ethiopia) to $23,474 (United States) with a mean of $5860. Based on this range, it is reasonable to use $1000 increments to report the slope. So let's restate our slope value in this more reasonable way: "For each additional $1000 in a country's gross domestic product it is predicted there would be a decrease of 1.0 infant deaths per 1000 births, holding constant for population and daily calorie intake (p= .03)." We arrived at the number "1.0" by multiplying the regression coefficient by 1000 (1000 x .001 = 1.0) the same amount we used to enhance the independent variable. This satisfies another goal of demographers and others who use numbers like this: Use the multiplication rule to move the decimal point until you get a whole number in one or both of the reported variables. It isn’t always possible to get whole numbers, but it is a general goal to approach. And finally: "For each additional 100 calories in the daily diets of the citizens in a nation, the number of infant deaths is predicted to decrease by 4.1 per 1000 births, holding constant for per capita GDP and population (p< .001)." We multiplied calories and the slope for infant mortalities by 100. Another value of the multiple regression models is that we can determine which independent variables are strongest and which are weakest in predicting the outcome. We do this by referring to the Probability (Sig.) column. In this case we can judge that population has the weakest predictive value, in fact the regression coefficient is not even statistically significant (p= .869). We can further determine that daily calories per person has the strongest predictive value because its p value is less than .001 (remember that there no probability that equals zero). So the strength of an independent variable in a regression model can be compared to any other independent variable by comparing the significance of their regression coefficients. Twelfth (and last) Lab Assignment (worth 5 points) Go to “PASW 17.0 for Windows” and open any of the data files you find interesting (and has the appropriate variables to complete this assignment). Produce a table of “Descriptives” so that you can study the data set for continuous variables. Be certain that you are clear on the units of analysis and on your units of measurement. Use only continuous variables. 1. Produce 1 multiple regression using at least 3 independent variables. 2. Include the ‘Descriptives’ table; the ‘Model Summary’ table; and the ‘Coefficients’ table. 3. Report on the R2 for the whole model. 4. Report separately for each slope. This assignment is on Tuesday, March 12th at 3 pm (on Turnitin.com). Penalty for late assignments. No labs accepted after March 19th. YOU MUST DO YOUR OWN WORK ON ALL LAB ASSIGNMENTS. You are welcome to ask for assistance on assignments, and you may discuss the course material with other students, however when you begin to follow the guidelines for the assignment you must work alone and hand in a work product that you accomplished by yourself. Handing in another person’s work will constitute cheating and could result in expulsion from the course. -5-