Research Methods & Data Analysis

advertisement
Correlation and
regression analysis
Week 8
Research Methods & Data
Analysis
Dr. Mario Mazzocchi
Research Methods & Data Analysis
1
Lecture outline
•
•
•
•
•
Correlation
Regression Analysis
The least squares estimation method
SPSS and regression output
Task overview
Dr. Mario Mazzocchi
Research Methods & Data Analysis
2
Correlation
• Correlation measures to what extent
two (or more) variables are related
– Correlation expresses a relationship that is
not necessarily precise (e.g. height and
weight)
– Positive correlation indicates that the two
variables move in the same direction
– Negative correlation indicates that they
move in opposite directions
Dr. Mario Mazzocchi
Research Methods & Data Analysis
3
Covariance
• Covariance measures the “joint
variability”
• If two variables are independent, then
the covariance is zero (however, Cov=O
does not mean that two variables are
independent)
Cov( x, y)   xy  E( xy)  E( x)E( y)
• Where E(…) indicates the expected
value (i.e. average value)
Dr. Mario Mazzocchi
Research Methods & Data Analysis
4
Correlation coefficient
• The correlation coefficient r gives a
measure (in the range –1, +1) of the
relationship between two variables
– r=0 means no correlation
– r=+1 means perfect positive correlation
– r=-1 means perfect negative correlation
• Perfect correlation indicates that a p%
variation in x corresponds to a p%
variation in y
Dr. Mario Mazzocchi
Research Methods & Data Analysis
5
Correlation coefficient
and covariance
Cov( x, y)
r
Pearson correlation coefficient
Var ( x)Var ( y)
r
 xy
Correlation coefficient - POPULATION
 x y
n
r
s xy
sx s y
n
SAMPLE
Dr. Mario Mazzocchi
1
sxy   xi yi 
n i 1
Research Methods & Data Analysis
n
x y
i 1
n
i
i 1
i
n
6
Bivariate and multivariate
correlation
• Bivariate correlation
– 2 variables
– Pearson correlation coefficient
• Partial correlation
– The correlation between two variables after
allowing for the effect of other “control”
variables
Dr. Mario Mazzocchi
Research Methods & Data Analysis
7
Significance level in
correlation
• Level of correlation (value of the correlation coefficient):
indicates to what extent the two variables “move
together”
• Significance of correlation (p value): given that the
correlation coefficient is computed on a sample,
indicates whether the relationship appear to be
statistically significant
• Examples
– Correlation is 0.50, but not significant: the sampling error is so
high that the actual correlation could even be 0
– Correlation is 0.10 and highly significant: the level of correlation
is very low, but we can be confident on the value of such
correlation
Dr. Mario Mazzocchi
Research Methods & Data Analysis
8
Correlation and
covariance in SPSS
Choose
between
bivariate &
partial
Dr. Mario Mazzocchi
Research Methods & Data Analysis
9
Bivariate correlation
Select the variables
you want to analyse
Require the
significance level
(two tailed)
Dr. Mario Mazzocchi
Research Methods & Data Analysis
Ask for additional
statistics (if
necessary)
10
Bivariate correlation
output
Correlations
Shopping style
Use coupons
Amount spent
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Shopping
style
Use coupons
Amount spent
1
.157**
.159**
.
.000
.000
779
779
779
.157**
1
.291**
.000
.
.000
779
779
779
.159**
.291**
1
.000
.000
.
779
779
779
**. Correlation is significant at the 0.01 level (2-tailed).
Dr. Mario Mazzocchi
Research Methods & Data Analysis
11
Partial correlations
List of
variables to be
analysed
Control
variables
Dr. Mario Mazzocchi
Research Methods & Data Analysis
12
Partial correlation output
- - -
P A R T I A L
Controlling for..
AMTSPENT
SIZE
STYLE
AMTSPENT
USECOUP
ORG
1.0000
.2677
-.0116
(
0)
P= .
USECOUP
.2677
(
775)
P= .000
ORG
C O R R E L A T I O N
(
775)
P= .746
(
775)
P= .000
P= .746
1.0000
.0500
(
0)
P= .
-.0116
(
775)
(
P= .164
.0500
(
775)
P= .164
775)
C O E F F I C I E N T S
- - -
Partial correlations still
measure the correlation
between two variables, but
eliminate the effect of other
variables, i.e. the correlations
are computed on consumers
shopping in stores of identical
size and with the same
shopping style
1.0000
(
0)
P= .
(Coefficient
/ (D.F.) / 2-tailed
Significance)
Dr. Mario Mazzocchi
Research Methods
& Data Analysis
" . " is printed if a coefficient cannot be computed
13
Bivariate and partial
correlations
• Correlation between Amount spent and Use
of coupon
– Bivariate correlation: 0.291 (p value 0.00)
– Partial correlation: 0.268 (p value 0.00)
• The amount spent is positively correlated
with the use of coupon (0=no use, 1=from
newspaper, 2=from mailing, 3=both)
• The level of correlation does not change much
after accounting for different shop size and
shopping styles
Dr. Mario Mazzocchi
Research Methods & Data Analysis
14
Linear regression analysis
yi     xi  
Intercept
Error
Dependent variable
Regression
coefficient
Dr. Mario Mazzocchi
Independent
variable
(explanatory
variable,
regressor…)
Research Methods & Data Analysis
15
Regression analysis
y

Cholesterol (mg/100 ml)


400







300








200
20
30
40
50
60
x
Age
Dr. Mario Mazzocchi
Research Methods & Data Analysis
16
Example
• We want to investigate if there is
a relationship between cholesterol
and age on a sample of 18 people
• The dependent variable is the
cholesterol level
• The explanatory variable is age
Dr. Mario Mazzocchi
Research Methods & Data Analysis
17
What regression analysis
does
• Determine whether a relationships exist
between the dependent and explanatory
variables
• Determine how much of the variation in
the dependent variable is explained by
the independent variable (goodness of
fit)
• Allow to predict the values of the
dependent variable
Dr. Mario Mazzocchi
Research Methods & Data Analysis
18
Regression and
correlation
• Correlation: there is no causal
relationship assumed
• Regression: we assume that the
explanatory variables “cause” the
dependent variable
– Bivariate: one explanatory variable
– Multivariate: two or more explanatory
variables
Dr. Mario Mazzocchi
Research Methods & Data Analysis
19
How to estimate the
regression coefficients
• The objective is to estimate the population
parameters  e  on our data sample:
yi  a  bxi  ei
• A good way to estimate it is by minimising the
error ei, which represents the difference
between the actual observation and the
estimated (predicted) one
Dr. Mario Mazzocchi
Research Methods & Data Analysis
20
Cholesterol (mg/100 ml) = 140.36 + 4.58 * age
R-Square = 0.65

Linear Regression
Cholesterol (mg/100 ml)


400





300


The objective is to
identify the line
(i.e. the a and b
coefficients) that
minimise the
distance between
the actual points
and the fit line







200

20
30
40
50
60
Age
Dr. Mario Mazzocchi
Research Methods & Data Analysis
21
The least square method
• This is based on minimising the square
of the distance (error) rather than the
distance
sy
Cov( x, y) sxy
b
 2 r
Var ( x)
sx
sx
a  y  bx
Dr. Mario Mazzocchi
Research Methods & Data Analysis
22
Bivariate regression in
SPSS
Dr. Mario Mazzocchi
Research Methods & Data Analysis
23
Regression dialog box
Dependent
variable
Explanatory
variable
Leave this
unchanged!
Dr. Mario Mazzocchi
Research Methods & Data Analysis
24
Regression output
Coefficientsa
Model
1
(Constant)
Age
Unstandardized
Coefficients
B
Std. Error
140.359
34.715
4.577
.838
Standardized
Coefficients
Beta
.807
t
4.043
5.464
Sig .
.001
.000
a. Dependent Variable: Cholesterol (mg /100 ml)
Statistical
significance
Value of the
coefficients
Dr. Mario Mazzocchi
Research Methods & Data Analysis
Is the coefficient
different from 0?25
Model diagnostics:
goodness of fit
Model Summary
Model
1
R
.807a
R Sq uare
.651
Adjusted
R Sq uare
.629
Std. Error of
the Estimate
45.218
a. Predictors: (Constant), Age
The value of the R square is included between 0 and 1 and
represents the proportion of total variation that is explained by
the regression model
Dr. Mario Mazzocchi
Research Methods & Data Analysis
26
R-square
SS y  SSreg  SSres
R 
2
Total
Variation Residual
variation explaned variation
by
regression
n
n
SSreg
SS y
n
2
ˆ
ˆ
(
y

y
)

(
y

y
)

(
y

y
)
 i
 i
 i
2
i 1
2
i 1
i 1
yˆi  a  bxi
Dr. Mario Mazzocchi
Research Methods & Data Analysis
27
Multivariate regression
• The principle is identical to bivariate
regression, but there are more explanatory
variables
• The goodness of fit can be measured through
the adjusted R-square, which takes into
account the number of explanatory variables
yi  b0  b1 x1i  b2 x2i  ...  bn xni  ei
Dr. Mario Mazzocchi
Research Methods & Data Analysis
28
Multivariate regression in
SPSS
• Analyze / Regression / Linear
Simply select
more than one
explanatory
variable
Dr. Mario Mazzocchi
Research Methods & Data Analysis
29
Output
Coefficientsa
Model
1
(Constant)
Health food store
Size of store
Gender
Vegetarian
Shopping style
Use coupons
Unstandardized
Coefficients
B
Std. Error
296.482
19.792
9.721
15.012
9.753
6.070
-69.598
7.483
-1.910
12.570
22.760
6.069
30.417
3.512
Standardized
Coefficients
Beta
.024
.059
-.302
-.005
.123
.285
t
14.980
.648
1.607
-9.301
-.152
3.750
8.662
Sig .
.000
.517
.109
.000
.879
.000
.000
a. Dependent Variable: Amount spent
Dr. Mario Mazzocchi
Research Methods & Data Analysis
30
Coefficient interpretation
• The constant represents the amount spent being 0 all other variables
(£ 296.5)
• Health food stores, Size of store and being vegetarian are not
significantly different from 0
• Gender coeff = -69.6: On average being woman (G=1) implies spending
£ 69 less
• Shopping style coeff = +22.8 S
– S=1 (shop per himself) = +22.8
– S=2 (shop per himself & spouse) = +45.6
– S=3 (shop per himself & family) = +68.4
Categorization
problems?
• Coupon use coeff = 30.4 C
–
–
–
–
C=1 (do not use coupon) = +30.4
C=2 (coupon from newspapers) = +60.8
C=3 (coupon from mailings) = +91.2
C=4 (coupon from both) = +121.6
Dr. Mario Mazzocchi
Research Methods & Data Analysis
31
Prediction
• On average, how much will someone
with the following characteristics spend:
– Male (G=0)
– Shopping for family (S=3)
– Not using coupons (C=1)
AMT  296.5  69.6  G  22.8  S  30.4  C  395.3
Dr. Mario Mazzocchi
Research Methods & Data Analysis
32
How good is the model?
Model Summary
Model
1
R
.439a
R Square
.193
Adjusted
R Square
.187
Std. Error of
the Estimate
104.08167
a. Predictors: (Constant), Use coupons, Veg etarian,
Gender, Health food store, Shopping style, Size of store
• The regression model explain less than 19% of the
total variation in the amount spent
Dr. Mario Mazzocchi
Research Methods & Data Analysis
33
Task A
• Examine the relationship between the
amount spent and the following
customer characteristics:
– Being male/female
– Being vegetarian
– Shopping for himself / for himself and others
– Shopping style (weekly, bi-weekly, etc.)
Potential methods:
• Battery of hypothesis testing & Analysis of variance
• Regression Analysis
Dr. Mario Mazzocchi
Research Methods & Data Analysis
34
Task B
• Examine the relationship between the amount
spent and the following customer
characteristics:
– Hypothesis: the average amount spent in healthoriented shop is higher than those of other shops.
True or false?
– Test the same hypothesis accounting for different
shop sizes
Potential methods:
• Battery of hypothesis testing & Analysis of variance
• Regression Analysis
Dr. Mario Mazzocchi
Research Methods & Data Analysis
35
Task C
• Find a relationship between the average
amount spent per store and the following
store characteristics:
– Size of store
– Health-oriented store
– Store organisation
Potential methods:
• Transform the customer data set into a store data set
• Battery of ANOVA
• Regression Analysis
Dr. Mario Mazzocchi
Research Methods & Data Analysis
36
Task D
• Hypothesis: is the amount spent by those that use coupon
significantly higher?
• What is the most effective way of distributing coupons:
– By mail
– On newspapers
– Both
Potential methods:
• Recode the variable into 1=not using coupon and 2=using
coupon
• Hypothesis testing
• Analysis of variance
Dr. Mario Mazzocchi
Research Methods & Data Analysis
37
Download