Uploaded by tperry8989

Multiple Regression

advertisement
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
TESTS FOR PREDICTION:
(
The ability to choose the most appropriate statistical test(s) to perform on
data, is dependent on the type of management question you seek to answer,
gain greater insight, and/or explore. These management questions can be
broadly classified into three “buckets” namely: difference, association, and
prediction.
This set of notes deal with the “bucket” of prediction, and more
specifically multiple regression.
In business management, we most often concerned with predicting the value
of a dependent variable based on the value of an independent variable. The
dependent variable is also referred to as the outcome, target or criterion
variable and the independent variable as the predictor, explanatory or
regressor variable. A simple linear regression is also referred to as a bivariate
linear regression or simply as a linear regression, premised on the
relationship being linear. While some relationships are not linear, we only deal
with linear relationships.
The simple linear regression deals with only one dependent and one
independent varaible such as the prediction of:
1. Sales based on the number of advertisements placed. The number of
advertisements placed is the independent varaible, with the amount of
sales generated and measured on a continous scale being dependent.
The easiest manner to think about this is: The amount of sales
generated is DEPENDENT and therefore the dependent varaible on
the number of advertisements placed independent variable. The
number of advertisements placed is INDEPENDENT of sales, i.e. not
dependent on sales in the short-term.
2. Employee productivity is dependent on the amount of training received.
The amount of training is the independent variable, with employee
productivity the dependent variable. The way to think of this is:
Employee productivity is DEPENDENT on the amount of training
received, making training the independent variable.
For those that confuse dependent and independent variables, the principle of
CPM may assist in understanding this:
C = Cause, P = Predictor, M = manipulated – if a varable has one of these
characteristics it is the INDEPENDENT VARIABLE.
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
From the above it should however become apparent that in business, that one
‘factor’ such as number of advertisements placed, is sufficient to understand
the driver of sales for example. There are at least a few more factors that
contribute to sales. So while simple regression allows for the understanding of
one variable predicting another, multiple regression allows one to build a more
holistic picture of what is influencing a factor such as sales, outside of only
advertisements. Thus, multiple regression allows you to predict a dependent
varaible (sales, fuel consumption, electricity demand for example) based on
multiple independent variables.
As multiple regression is an extension of simple regression, the following are
basic requirements for understanding and executing a mutliple regression:
1. Two or more independent variables; and
2. One dependent variable
The independent variable can be either continous or categorical, while the
dependent variable must be continous.
Types of business problems/scenario’s solved using multiple regression:
1. Predict new values for the dependent variable given the independent
variable
a. Personnell professionals (Human Resource Practitioners)
generally used multiple regression to determin equitable
compensation. You can determine a number of factors or
dimensions such as “amount of responsibility” or “number of
people to supervise” that you believe to contribute to the value
of a job. The personnel analyst then ususal conducts a salary
survey among comparable companies in the market, recording
the salaries and respective characteristics (i.e. dimensions) for
different positions. This information can be used to understand
to build a regression model, to understand the underlying drivers
of salary in the market. Fr example the model (equation) may be
Salary = 0.5(amount of responsibility) + 0.8 (number of people to
supervise) + 30,000. This indicates that the salary of this
position with no responsibility (hardly beleiveable) and no
supervision would attract a base salary of R30, 000. However,
as soon as the amount of responsibility increases, the salary
increases by 0.5; and so on. The key here is that the biggest
underlying driver of salary is number of people that are
supervised, as its co-efficient is larger than the amount of
responsibility.
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
2. To determine or understand, how much variation in the dependent
variable is explained by the independent variables.
a. Using the salary example above, we know that or can
reasonably assume that salary is not only determined by the two
factors or dimensions of “amount of responsibility” and “ number
of people to supervise” but other factors as well such as number
of years of relevant work experience. Therefore a regression
model, will allow us to understand how much variation is
expalined by the “amount of responsibility” and “ number of
people to supervise”, in this case it may be 20%, which simply
means that the variation in salary that is observed only 20% is
explained by the two dimensions in the equation, and 80% is
explained by “other” factors.
Like all statistical tests, there are underlying assumptions that are either
assumed to be true or tested. For multiple regression the underlying
assumptions are assumed true particular for this course.
All statistical tests performed using software are underpinned by specific null
and alternate hypotheses. These are not those that YOU specify, rather those
that are specified and tested by the software package that is used. For the
multiple linear regression the following are the hypotheses:
Null Hypothesis (H0): There is no relationship between the X variables and
the Y variables.
The null hypothesis for the multiple linear regression is simply stating that the
fit of the observed Y values to those predicted by the multiple regression are
no better than you would expect by chance.
Alternate hypothesis (H1): The is a relationship between the X variables and
the Y variables.
Using the procedure of a multiple linear regression in SPSS below, we will
start gaining an understanding of the above theory.
(
(
(
(
(
(
(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
PROCEDURE IN SPSS:
(
Context: A health researcher wants to be able to predict maximal aerobic
capacity (VO2max), an indicator of fitness and health. Ultimately, the
researcher is trying to understand how fit and healthy executives are during
the MBA. Normally, to perform this procedure requires expensive laboratory
equipment and necessitates that an individual exercise to their maximum (i.e.,
until they can longer continue exercising due to physical exhaustion). This can
put off those individuals that are not very active/fit and those individuals that
might be at higher risk of ill health (e.g., older unfit subjects). For these
reasons, it has been desirable to find a way of predicting an individual's
VO2max based on more easily and cheaply measured attributes. To this end,
the researcher recruits 100 MBA students to perform a maximum VO2max
test, but also records their age, weight, heart rate and gender. Heart rate is
the average of the last 5 mins of a 20 mins much easier, lower workload
cycling test. The researcher's goal is to be able to predict VO2max based on
age, weight, heart rate and gender. This will allow the researcher to
understand how healthy and fit executives are during the MBA.
Step 1:
Click Analyse > Regression > Linear.
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
You will be presented with below Linear Regression dialog box:
(
(
(
Step 2:
Transfer the dependent variable VO2 Max, into the dependent box using the
by first clicking on the variable VO2 Max and using the arrow key to pull
it across into the dependent box. NOTE: You can only add one variable here.
You cannot predict the outcome of multiple variables at the same time.
Highlight the independent varaibles of age, weight , heart rate and gender into
the independent(s) box using the
(
(
button. The below should result:
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
Step 3:
Click the Statistics button, and you will be presented with the below:
(
(
(
(
Ensure the following box are checked: Estimates (sometimes selected by
default); Confidence intervals; Model Fit (sometimes selected by default); and
R squared change.
Click Continue.
(
Step 4:
Click OK to generate the output.
(
!
!
!
!
!
!
!
!
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
PROCEDURE IN EXCEL:
!
Please!note!that!this!requires!the!Add1In!Data!Analysis.!I!have!used!a!
different!example!for!the!screen!shots!below,!however!the!procedure!is!the!
same.!!
!
Step 1:
Click on Data > then Click on Data Analysis > Click on Regression. The below
image should result:
(
!
On!the!above!image,!note!that!we!are!trying!to!predict!Cars!(dependent!
variable)!by!HH!size!and!cubed!HH!size!(two!independent!variables).!
!
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
Step 2:
Input the dependent variable into Input Y range and the other 2 variables into
the input X range. See image below:
!
!
Step 3:
Ensure that you have selected the correct output range and click OK to
generate the output.
!
!
!
!
!
!
!
!
!
!
!
!
!
!
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
!
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
!
INTERPRETING THE OUTPUT FROM SPSS:
Step 1:
The first output from SPSS is the variables that are contained in the equation:
(
(
Ensure that SPSS has not removed any variables, if it has why it has.(
(
Step 2:
Look at the output labelled Model Summary:
(
There are 3 key measures that you are interested in on this box:
1. The "R" column represents the value of R, the multiple correlation
coefficient. When there is only one independent variable, as in simple
linear regression, R is r, the Pearson correlation coefficient. The
multiple correlation coefficient, R, generalizes the correlation
coefficient, r. R can be considered to be one measure of the quality of
the prediction of the dependent variable; in this case, VO2 Max. R is, in
fact, the correlation between the predicted scores and the actual
scores of the dependent variable. R can range in value from 0 to 1,
with higher values indicating that the predicted values are more closely
correlated to the dependent variable (i.e., the greater the value of R,
the better the independent variables are at predicting the dependent
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
variable). A value of 0.760, in this example, indicates a good level of
prediction.
2. The second key measures are related, and therefore dealt with
together: The "R Square" column represents the R2 value (also called
the coefficient of determination). This represents the proportion of
variance in the dependent variable that can be explained by the
independent variables. You can see from our value of 0.577 that our
independent variables explain 57.7% of the variability of our dependent
variable, VO2 Max. However, R2 is based on the sample and is
considered a positively-biased estimate of the proportion of the
variance of the dependent variable accounted for by the regression
model (i.e., it is larger than it should be when generalizing to a larger
population). "Adjusted R Square" (adj. R2) attempts to correct for this
bias and thus provides smaller values, as would be expected in the
population. As such, it is preferable to use this value to report the
proportion of variance explained (i.e., report 55.9%, instead of 57.7%).
From the above we can already see that 55.9% of the variance in VO2
Max is explained by Gender, Age, Heart Rate and Weight.
Step 3:
Look at the output labelled ANOVA.
The ANOVA table in a multiple regression indicates whether the model
that YOU have proposed is a good fit for the data. As such, the most
important column that one needs to understand here is the Sig. column (pvalue). The general significance rule applies here. If this value is less than
0.05 (assuming we are testing at 95% confidence interval) then the model
that YOU have proposed is a good fit for the data. If the Sig. colum (pvalue) is greater than 0.05 (assuming you are testing at a 95% confidence
interval) then the model you are proposing is a bad fit for the data. If this is
the result, it is generally best to stop and rethink what other variables other
than those in this model should apply.
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
Step 4:
Look at the table labeled Coefficients:
!
In this table the first column you want to look at is the B (Beta) column under
Unstandardised co-efficients.
NOTE: B (Beta) in a multiple regression can be thought of as m or the slope in
a general straight line equation.
From the above we can build our model/equation:
VO2 Max = 87.830 – 0.165 (age) – 0.385 (Weight)- 0.118 (heart rate) +
13,208 (Gender)
From the above the first notable aspect is that the constant is 87.830, which
technically means that if all the other independent variables were 0, then the
VO2 max would be 87.830. In the context of the above, this is impossible as
this would mean that you do not have a person being measured, the person
does not exist. However, the principle applies; but the CONTEXT matters.
NOTE: As gender was coded as 1 for male and for 2 for female you would
need to add that code, i.e. 1 for male into gender and 2 for male. This
indicates that females in the above equation would at minimum have a 13.208
higher VO2 max than males.
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
Step 4:
While the above starts giving us an indication of the equation, we must
understand if each independent variable is a significant predictor/contributor
to the equation. Technically, we are look if the independent variable coefficients are statistically different from 0.
(
(
(
To understand the significant predictors/ independent variables look at the
columns labelled t and sig. The t column gives the t-value, while the Sig.
column gives us the p-value. As we can observe from the above, all
independent variables are significant, i.e. less than 0.05.
(
Step 5: Look at the table labelled Coefficients, again, and look at the column
for 95% Confidence Interval for B, Lower Bound and Upper Bound. This
simply indicates the upper and lower 95% Confidence bounds of the
independent variables.
(
(
The conclusion that we draw from the multiple regression: VO2 Max can be
predicted from our independent variables of age, weight, heart rate and
gender. Each are significant predictors. The equation is: VO2 Max = 87.830 –
0.165 (age) – 0.385 (Weight)- 0.118 (heart rate) + 13,208 (Gender)
(
!
!
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
!
!
INTERPRETING THE OUTPUT FROM EXCEL:
!
Excel provides a single table output, as indicated below. I use the output I
generated from another dataset to interpret the findings:
(
Step 1:
Look at the Regression Statistics:
(
(
(
(
There are 3 key measures that you are interested in on this box:
1. The "Mulitple R" row represents the value of R, the multiple
correlation coefficient. When there is only one independent variable,
as in simple linear regression, R is r, the Pearson correlation
coefficient. The multiple correlation coefficient, R, generalizes the
correlation coefficient, r. R can be considered to be one measure of the
quality of the prediction of the dependent variable; in this case,
Cars. R is, in fact, the correlation between the predicted scores and the
actual scores of the dependent variable. R can range in value from 0 to
1, with higher values indicating that the predicted values are more
closely correlated to the dependent variable (i.e., the greater the value
of R, the better the independent variables are at predicting the
dependent variable). A value of 0.760, in this example, indicates a
good level of prediction.
2. The second key measures are related, and therefore dealt with
together: The "R Square" column represents the R2 value (also called
the coefficient of determination). This represents the proportion of
variance in the dependent variable that can be explained by the
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
independent variables. You can see from our value of 0.802 that our
independent variables explain 80.2% of the variability of our dependent
variable, Cars. However, R2 is based on the sample and is considered
a positively-biased estimate of the proportion of the variance of the
dependent variable accounted for by the regression model (i.e., it is
larger than it should be when generalizing to a larger population).
"Adjusted R Square" (adj. R2) attempts to correct for this bias and
thus provides smaller values, as would be expected in the population.
As such, it is preferable to use this value to report the proportion of
variance explained (i.e., report 60.5%, instead of 80.20%).
From the above we can already see that 60.5% of the variance in Cars is
explained by HH Size and Cubed HH size.
Step 2:
Look at the ANOVA part of the table:
(
(
(
The ANOVA table in a multiple regression indicates whether the model
that YOU have proposed is a good fit for the data. As such, the most
important column that one needs to understand here is the Significance F
(p-value). The general significance rule applies here. If this value is less
than 0.05 (assuming we are testing at 95% confidence interval) then the
model that YOU have proposed is a good fit for the data. If the
Significance F (p-value) is greater than 0.05 (assuming you are testing at a
95% confidence interval) then the model you are proposing is a bad fit for
the data. If this is the result, it is generally best to stop and rethink what
other variables other than those in this model should apply. THIS IS THE
CASE in the above example. For illustrative purposes we continue.
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
Step 3:
Look at the last aspect of the output which contains the intercept and
variables.
In this section the first column you want to look at is the co-efficients (same B
(Beta) column under Unstandardised co-efficients in the SPSS output).
(
From the above we can build our model/equation:
Cars = 0.896 + 0.33 (HH Size) + 0.002 (Cubed HH size)
(
Step 4:
While the above starts giving us an indication of the equation, we must
understand if each independent variable is a significant predictor/contributor
to the equation. Technically, we are look if the independent variable coefficients are statistically different from 0.
(
To understand the significant predictors/ independent variables look at the
columns labelled t stat and p-value. The t stat column gives the t-value, while
the p-value column gives us the p-value. As we can observe from the below,
all independent variables are NOT significant, i.e. greater than 0.05.((
(
(
Step 5:
Look at the column for 95% Confidence Interval’s, Lower 95% and Upper 95%
. This simply indicates the upper and lower 95% Confidence bounds of the
independent variables.
The conclusion that we draw from the multiple regression: Cars cannot be
predicted from our independent variables of HH size and cubed HH size.
(
AUTHOR:(Manoj(D.(Chiba((2015)(Not(for(distribution(without(author’s(
permission(
(
(
Download