Multiple Regression

advertisement
Multiple Regression
Multiple regression is a statistical technique that is used when examining the
relationship between three or more continuous variables and in which one of the
variables can be thought of as predicted by the other variables. Similar to simple
linear regression, in which there is only one independent or predictor variable, in
multiple regression there are two or more independent variables. The following
table (from a somewhat larger Excel file LifeExpectancywithallnotes.xlsx
downloaded from the World Health Organization http://www.who.int/whosis/en/
of a select list of countries and a select list of indicators from the most recent
data available) contains life expectancy at birth along with five factors that may
have an influence on how long people live, we might develop a model to predict
length of life.
Location
Life
expectancy
(years) at
birth
Adult
literacy
Hospital
beds
Number of
rate (%)
(per 10 000
Physicians
both sexes
population)
Incidence
of
tuberculosi
s
(per 100
000
population
per year)
Newborns
with
low birth
weight (%)
Australia
82
40
47875
6
7
Bangladesh
63
47.5
3
42881
225
30
Botswana
52
81.2
24
715
551
10
Brazil
72
88.6
26
198153
50
10
Canada
81
34
62307
5
6
Chile
78
95.7
23
17250
15
5
China
73
90.9
22
1862630
99
6
Cuba
Czech
Republic
78
99.8
49
66567
9
6
77
84
36595
10
7
Denmark
79
38
19287
8
5
Egypt
68
22
179900
24
12
France
81
73
207277
14
7
Germany
80
83
284427
6
7
Ghana
57
57.9
9
3240
203
11
Greece
80
96
47
55556
18
8
India
63
61
645825
168
30
Indonesia
68
90.4
29499
234
9
Israel
81
60
25138
8
8
Italy
81
40
215000
7
6
Japan
83
141
270371
22
8
Jordan
71
19
13460
5
10
71.4
98.4
91.1
Kuwait
78
93.3
19
4840
24
7
Mexico
74
91.6
10
195897
21
9
The model would follow the format
In the life expectancy example y would represent life expectancy and there would
be five Xs (adult literacy rate (%), hospital beds per 10,000 population, number of
physicians, tuberculosis rate, and newborns (%) with low birth weight). The b
terms represent the slope of y with the corresponding x while holding each of the
other x variables in the model constant. In simple linear regression b represents
the change in y per unit change in x. In multiple regression, however, b
represents the change in y per unit change in the corresponding x after taking
into account the effects of the other x variables in the model. Performing the
multiple regression with the 42 countries in the Excel file results in the following
prediction equation (where y represents the predicted life expectancy, x1
represents the adult literacy rate, x2 represents the number of hospital beds per
10,000 population, x3 represents the number of physicians, x4 represents the
tuberculosis rate, and x5 represents the percentage of low birth weight.
y = 47.26 + .33(x1) -.076(x2) - .0000003(x3) - .029(x4) - .059(x5)
After developing the multiple regression equation, values for the Xs can be
substituted and the predicted y can be computed. To see how multiple
regression is performed in Excel, please watch the MultipleRegression video clip.
The file used for this example is located at:
Baseball2009.xlsx and the annotated output contained in the video clip is:
As in simple linear regression, r square is the amount of variation in y that’s
explained by the Xs. In our baseball example, the five independent variables
explain slightly more than 87% of the variation in wins. The unexplained 13% of
the variation in wins is explained by factors not included in the model.
If the model is statistically significant, the significance (p-values) of the individual
predictors can be examined. In our above example, because the model is
significant, we can proceed to examine the significance of the five Xs and find
that four of them are statistically significant with the Opponents Batting Average
being the only predictor that is not statistically significant. If the overall
regression model is not statistically significant, it is not appropriate to examine
the statistical significance of the independent variables.
Other Multiple Regression Considerations
A general rule of thumb is that in a multiple regression model us that
ideally there would be a minimum of 25 observations per independent
variable. So, we would need to have at least 125 observations for our
baseball problem. Regression coefficients are very unstable with fewer
observations, thus our confidence in the above model would be limited.
Another consideration, like with the other statistics we’ve covered in this
course, is that simply because a relationship is statistically significant, it
doesn’t mean that it’s important or useful. Although the r square provides
evidence regarding the usefulness of the regression equation, in the end
the importance or usefulness of a particular result is a management
decision!
A common problem with multiple regression is when two or more of the
independent variables are highly inter-correlated. This problem is known
as multicollinearity and is particularly problematic when trying to determine
the statistical significance of the individual predictors. When simply
developing a prediction model or determining how much variance in the
dependent variable is explained by a group of independent variables,
multicollinearity is not a problem. Although dealing with issues of
multicollinearity are beyond the scope of this course, some ways this
problem can be dealt with include 1) combining the inter-correlated
variables, 2) collecting more observations, 3) eliminating one of the
offending variables from the model, or 4) standardizing the independent
variables.
Data is sometimes not very clean. In other words oftentimes there are
errors or missing values in data sets. In many studies observations are
not obtainable or people simply do not answer every item in a
questionnaire. In software specifically designed for statistical analysis,
such as SAS (http://www.sas.com/ ) or SPSS (http://www.spss.com/),
missing values are automatically excluded from the analysis. In the
regression routines in Excel, however, this is not the case. Thus in
performing regression (linear and multiple) using Excel, remember that if
the routine does not work, a common problem is that there may be
missing data. The remedy for this problem is to manually delete the
observation(s) which has the missing response for that particular analysis.
Note: In large data sets with considerable missing data sorting the data,
deleting the rows with missing data, then repeating this process simplifies
the data clean-up.
There are numerous regression techniques that have been developed for
specific purposes. Several of these are designed to determine the
predictors from a larger set that maximize the variance explained in the
dependent variable. Examples of these specialized techniques include
stepwise regression (both forward and backward) and best sub-set
regression.
Download