Session 9- Data Analysis II

advertisement
Session 9. Data Analysis II
MKTG 3010 MARKETING RESEARCH
1
Marketing Research Process
Step 1: Defining the Problem
Step 2: Developing an Approach to the Problem
Step 3: Formulating a Research Design
Step 4: Doing Field Work or Collecting Data
Step 5: Preparing and Analyzing Data
Step 6: Preparing and Presenting the Report
2
Data Analysis - Summary


Basic Data Analysis (summary statistics, histogram)

Q1. % had seen a movie in the last week

Q2. Distribution of # of movies seen on TV since end of Fall.
One
Variable
Comparisons (Joint distribution/Pivot table T-Tests)

Q 3. Who saw a movie last week? (% male, % Female)

Q 4. Does the average number of TV movies seen differ by
men vs. women?

Two
Variables
Q5: Is the average rating of Drama different from the average
rating of Mysteries?

Relationships (Correlation, Linear Regression)

3
Q6: What predicts intention to go see “All About Steve”?
Multiple
Variables
Data Analysis - Summary


Basic Data Analysis (summary statistics, histogram)

Q1. % had seen a movie in the last week

Q2. Distribution of # of movies seen on TV since end of Fall.
One
Variable
Comparisons (Joint distribution/Pivot table T-Tests)

Q 3. Who saw a movie last week? (% male, % Female)

Q 4. Does the average number of TV movies seen differ by
men vs. women?

Two
Variables
Q5: Is the average rating of Drama different from the average
rating of Mysteries?

Relationships (Correlation, Linear Regression)

4
Q6: What predicts intention to go see “All About Steve”?
Multiple
Variables
Interval or Ratio Scales:
Comparing Means
Need sample size, mean and standard
deviation.
 Intuition: Is 2.5 different from 3.5?



Yes:
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
No:
5
Comparing Two Means
 Independent
means:
Two different groups of people

Who saw more movies, men or women?
 Dependent
means:
Two questions among same people

6
More movies on TV or in the theater?
Comparing Independent Means
TV Movies Seen x Gender


Does the average number of TV movies
seen differ by men vs. women?

Men = 3.0

Women = 2.0
t
Independent Samples => t-test
X1  X 2
1
2 1
sp (  )
n1 n2
(n1  1) S12  (n2  1) S22
S 
(n 1  n2  2)
2
p
7
Comparing Independent Means
TV Movies Seen x Gender
8
Comparing Independent Means
TV Movies Seen x Gender
9
Comparing Independent Means
TV Movies Seen x Gender
t-Test: Two-Sample Assuming Unequal Variances
Mean
Variance
Observations
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Variable 1
Variable 2
3.000
2.043
8.057
5.172
36.000
47.000
0.000
66.000
1.657
0.051
1.668
0.102
1.997
10% chance, if there was difference ( Should not reject
the null)
10
Interpretation of P-value < .05

AKA




What does .05 mean?


39%
On average, 1/20 times, when you say “the average is differ by
men and women”, you will be wrong.
P-value = .102 for the two-tailed test


Type I error
Alpha error
Sending an innocent man up the river
10.2% chance of getting this extreme a result when really 50-50.
P-value = .0051 for the one-tailed test
5.1% chance of getting this low a result when really 50-50.
 Fail to reject Ho, at 95% confidence level.
 “Difference by gender is not statistically significant”

Comparing Dependent Means
Liking of Drama (V32) vs. Liking of Mysteries (V33)
Q5: Is the average rating of Drama sdifferent
from the average rating of Mysteries?



Drama = 4.0
Mystery = 3.5

Calculate “difference score” for each person:
D= V33-V32

Dependent Samples t-test w/ Ho: D=0
12
Comparing Dependent Means
Drama vs. Mystery ratings
t-Test: Paired Two Sample for Means
Mean
Variance
Observations
Pearson Correlation
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
13
Variable 1
3.964
0.767
83.000
0.355
0.000
82.000
3.746
0.000
1.664
0.000
1.989
Variable 2
3.542
0.861
83.000
< 0.1 % chance, if there was
no difference (Should reject
the null)
Correlation: Comparable Measure of the
Relationship Between Two Variables

r
Correlation is a ratio = r
amount that two variables actually do co-vary
=maximum amount that the two variables could co-vary
COVARIANCE
r =PRODUCT OF STANDARD DEVIATIONS



It ranges from -1 to 1
1 and -1 are perfect positive and negative
correlation
0 is no correlation at all
Perfect Correlation
Perfect Negative Correlation
Correlation = -1
12
12
11
11
10
10
9
9
8
8
7
7
Trips
Trips
Perfect Positive Correlation
Correlation = 1
6
5
6
5
4
4
3
3
2
2
1
1
0
0
0
2
4
6
Income ($10K)
8
10
12
0
2
4
6
Income ($10K)
8
10
12
High Correlation
High Positive Correlation
Correlation = 0.93
12
11
10
9
Trips
8
7
6
5
4
3
2
1
0
0
2
4
6
Income ($10K)
8
10
12
Lower Correlation
Lower Negative Correlation
Correlation = -0.5
12
12
11
11
10
10
9
9
8
8
7
7
Trips
Trips
Lower Positive Correlation
Correlation = 0.5
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
2
4
6
Income ($10K)
8
10
12
0
2
4
6
Income ($10K)
8
10
12
No Correlation
Correlation = 0
12
11
10
9
Trips
8
7
6
5
4
3
2
1
0
0
2
4
6
Income ($10K)
8
10
12
Misinterpreting No Correlation

Measure of linear association
12
11
10
9
Trips
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
12
Income ($10K)

Correlation = 0. No relationship?
Factors Impacting Correlations


Slope
High
Lower
Lower
Lowest
Spread
Correlations:
Not a substitute for looking at the data
Mean = 7.5
StDev = 4.12
Correlation = .81
Anscombe, American Statistician 1973
Correlations:
Imperfect Measure of Relationships
Source: Wikipedia
Regression Analysis

Objective


Uses




Quantify the relationship between a criterion (dependent) variable and
one or more predictor (independent) variables
Understanding how a predictor variable influences the dependent
variable
Predicting the dependent variable based on specified values of the
predictor variables
Forecasting how the dependent variable changes when the predictor
variables change
Examples


Sales
= f(prices, promotions, …)
Satisfaction= f(service, tenure, segment, …)
The Regression Equation True Model
Dependent Variable
(Satisfaction)
Y
Independent Variable
(Service)
 0  1 X 1  e
Constant
(Intercept)
Coefficient of
Independent Variable
(Slope)
Regression Assumptions:
Relationship between variables is linear
 Errors are normally distributed,
uncorrelated
 Amount of error is the same at every point
on the line

25
How Helpful is the Regression?
Recall that sum of squared error is a
measure of inaccuracy and that smaller is
better
 What % of Total Sum of Squared Error is
the Sum of Squared Error associated with
the regression?
 R2 is the amount of variance in the
dependent variable explained by the
independent variable (Correlation squared)

26
Application of Regression:
Predicting “All About Steve”
• Q6: What predicts intention to go see “All About Steve”? (V89)
#
5
Answer
4
Likely
3
Undecided or Don't
Know
2
Unlikely
1
Very Unlikely
Very Likely
Response
1
%
2%
2
5%
23
52%
7
16%
11
25%
• Frequency of movie-viewing?
V20: On TV
V22: In a theater
V24: On rented DVD
V26: On owned DVD
What Would Be A Good Predictor?

Excel : Correlation
28
29
What Would Be A Good Predictor?
Correlations: V89, V20, V22, V24, V26
V89
V20
V22
V24
V26
V89
1
V20
0.099
1.000
Movie theater
V22
0.298
0.152
1.000
Rented
V24
0.095
-0.026
0.127
1.000
Owned
V26
0.217
0.275
0.140
0.184
TV movies
1.000
Potential Predictor
V22: Movies seen in a movie theater
Poor fit.
Pearson correlation of
V22 and V89 = .30
Influential
First Regression: V89 ~ V22
32
33
SUMMARY OUTPUT
R-squared: Measure of model fit
(what % of variance in the data is
explained by the model?)
= Correlation2 .3 * .3 = .09
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.298
0.089
0.067
0.964
44
ANOVA
Significance test: Variance explained > 0?
Compare error when predicting from the
mean to errors from the model
Significance test: Model terms ≠ 0?
df
Regression
Residual
Total
Intercept
V22
The regression equation is
V89 = 2.196 + 0.105 V22
SS
1
42
43
3.795
39.000
42.795
Coefficients Standard Error
2.196
0.186
0.105
0.052
MS
3.795
0.929
t Stat
11.793
2.022
F
4.087
Significance F
0.050
P-value
0.000
0.050
Lower 95%
1.820
0.000
Graphic representation
The regression equation is
V89 = 2.196 + 0.105 V22
V22 Line Fit Plot
6
5
V89
4
3
V89
2
Predicted V89
1
0
0
5
10
V22
35
15
20
Multiple Regression

Effect of each variable controlling for other
variables in the model



If predictors are uncorrelated, same as correlation
If predictors are moderately correlated, useful to know
If predictors are highly correlated, they are redundant
Multicollinearity → Estimates are nonsensical,
use correlations or form an index instead
What About Other Predictors?
Watching Owned Movies (V26)
Second Regression: V89 ~ V22 + V26
SUMMARY OUTPUT
R-squared is better…
But:
Predictors not significant
Overall F-test is weaker
Conclusion:
Does not help –
take V26 out.
Regression Statistics
Multiple R
0.347
R Square
0.120
Adjusted R
Square
0.077
Standard Error
0.958
Observations
44.000
ANOVA
df
Regression
Residual
Total
Intercept
V22
V26
2
41
43
SS
5.145
37.650
42.795
Coefficien Standard
ts
Error
2.100
0.201
0.096
0.052
0.064
0.053
F
2.802
Significan
ce F
0.072
t Stat
P-value
10.428
0.000
1.844
0.072
1.213
0.232
Lower
95%
1.694
-0.009
-0.043
MS
2.573
0.918
Upper
95%
2.507
0.201
0.172
Lower
95.0%
1.694
-0.009
-0.043
Uppe
95.0%
2.5
0.2
0.1
How About Other Predictors?
V37: Romantic Comedy Liking
Third Regression: V89 ~ V22 + V37
SUMMARY OUTPUT
R-squared is a lot better…
Overall F-test is better
Regression Statistics
Multiple R
0.387
R Square
0.150
Adjusted R
Square
0.108
Standard Error
0.942
Observations
44.000
V22 & V37 are near significant
ANOVA
Even controlling for attitude towards
romantic comedy,
positive effect of seeing more movies
df
Regression
Residual
Total
…but what about gender?
Intercept
V22
V37
2
41
43
SS
6.402
36.393
42.795
Coefficien Standard
ts
Error
3.148
0.585
0.101
0.051
-0.253
0.148
F
3.606
Significan
ce F
0.036
t Stat
P-value
5.385
0.000
1.994
0.053
-1.714
0.094
Lower
95%
1.968
-0.001
-0.552
MS
3.201
0.888
Upper
95%
4.329
0.203
0.045
Lower
95.0%
1.968
-0.001
-0.552
Final Model
V22, V37 and Gender
Final Regression: V89 ~ V22 + V37 + V105
SUMMARY OUTPUT
Best model.
Why not include V26 ?
Avoids OVERFITTING:
Maximizing prediction in
sample, reduces prediction
out of sample.
Regression Statistics
Multiple R
0.458
R Square
0.210
Adjusted R
Square
0.151
Standard
Error
0.919
Observation
s
44
ANOVA
df
Regression
Residual
Total
Intercept
V22
V37
V105
3
40
43
SS
8.985
33.810
42.795
Coefficien Standard
ts
Error
2.862
0.593
0.117
0.050
-0.434
0.177
0.606
0.347
MS
2.995
0.845
F
3.543
t Stat
P-value
4.823
0.000
2.330
0.025
-2.446
0.019
1.748
0.088
Significan
ce F
0.023
Lower
Upper
Lower
Upper
95%
95%
95.0%
95.0%
1.663
4.062
1.663
4.062
0.016
0.219
0.016
0.219
-0.792
-0.075
-0.792
-0.075
-0.095
1.307
-0.095
1.307
Cautions About Regression:
Model Building

Overfitting


Omitted variable/Mis-specification bias:


Interpretation of coefficients can be wrong when key variables
are left out
Causality assumption


Overly complex models fit “noise” in the data rather than
generalizable patterns
Regression is relational, can’t assume
predictors cause the dependent variable.
And more …

Endogeneity, heterogeneity …
Why Model?
1.
Explanatory: To understand the relationships
among variables in a process.
 Validating survey measures. How does intent
relate to sales?

Test hypotheses. Is ad recall related to purchase
intent?

Assess relationships. Which attributes are most

closely related to favorability?
Variable meaning / model specification important
Wyner, “Why Model?”
Marketing Reserch 2006
Why Model?
2.
Predictive: To make predictions based on the
available data.

Identifying high value prospects. Which of these
variables are predictive of responding to an offer? What is
the probability of a given person to respond?


Predicting an outcome. What is the predicted sales of
a new product based on an analysis of past products?
Variable specification, causality assumptions less of an
issue (“proxy” variables OK). Caution: overfitting, unusual
cases!
Wyner, “Why Model?”
Marketing Research 2006
Why Model?
3. Decision Support: To quantify relative
consequences of management actions
 What if…? What if we change our pricing, media-mix,


promotion strategy, etc…
Product design. Which combination of features
would yield the most attractive product?
Strong causality assumptions, model based on “levers”
(even if small effect), must “control for” other factors
(omitted variable bias). Applicable in observed range
only.
Wyner, “Why Model?”
Marketing Research 2006
Data Analysis - Summary


Basic Data Analysis (summary statistics, histogram)

Q1. % had seen a movie in the last week

Q2. Distribution of # of movies seen on TV since end of Fall.
One
Variable
Comparisons (Joint distribution/Pivot table T-Tests)

Q 3. Who saw a movie last week? (% male, % Female)

Q 4. Does the average number of TV movies seen differ by
men vs. women?

Two
Variables
Q5: Is the average rating of Drama different from the average
rating of Mysteries?

Relationships (Correlation, Linear Regression)

44
Q6: What predicts intention to go see “All About Steve”?
Multiple
Variables
“David McCandless: The Beauty of
Data Visualization”
45
Data Visualization

46
Edward Tufte is an American
statistician and professor
emeritus of political science,
statistics, and computer science
at Yale University. He is noted
for his writings on information
design and as a pioneer in the
field of data visualization.
The Minard Map - "The best statistical
graphic ever drawn"
Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the
losses suffered by Napoleon's army in the Russian campaign of 1812. Beginning at the PolishRussian border, the thick band shows the size of the army at each position. The path of Napoleon's
retreat from Moscow in the bitterly cold winter is depicted by the dark lower band, which is tied to
47
temperature and time scales.
A picture is worth a thousand words

As a statistical chart, the map unites six different sets of data.






48
Geography: rivers, cities and battles are named and placed
according to their occurrence on a regular map.
The army’s course: the path’s flow follows the way in and out that
Napoleon followed.
The army’s direction: indicated by the colour of the path, gold
leading into Russia, black leading out of it.
The number of soldiers remaining: the path gets successively
narrower, a plain reminder of the campaigns human toll, as each
millimetre represents 10.000 men.
Temperature: the freezing cold of the Russian winter on the return
trip is indicated at the bottom, in the republican measurement of
degrees of réaumur (water freezes at 0° réaumur, boils at 80°
réaumur).
Time: in relation to the temperature indicated at the bottom, from
right to left, starting 24 October (pluie, i.e. ‘rain’) to 7 December (27°).
Download