Session 9. Data Analysis II MKTG 3010 MARKETING RESEARCH 1 Marketing Research Process Step 1: Defining the Problem Step 2: Developing an Approach to the Problem Step 3: Formulating a Research Design Step 4: Doing Field Work or Collecting Data Step 5: Preparing and Analyzing Data Step 6: Preparing and Presenting the Report 2 Data Analysis - Summary Basic Data Analysis (summary statistics, histogram) Q1. % had seen a movie in the last week Q2. Distribution of # of movies seen on TV since end of Fall. One Variable Comparisons (Joint distribution/Pivot table T-Tests) Q 3. Who saw a movie last week? (% male, % Female) Q 4. Does the average number of TV movies seen differ by men vs. women? Two Variables Q5: Is the average rating of Drama different from the average rating of Mysteries? Relationships (Correlation, Linear Regression) 3 Q6: What predicts intention to go see “All About Steve”? Multiple Variables Data Analysis - Summary Basic Data Analysis (summary statistics, histogram) Q1. % had seen a movie in the last week Q2. Distribution of # of movies seen on TV since end of Fall. One Variable Comparisons (Joint distribution/Pivot table T-Tests) Q 3. Who saw a movie last week? (% male, % Female) Q 4. Does the average number of TV movies seen differ by men vs. women? Two Variables Q5: Is the average rating of Drama different from the average rating of Mysteries? Relationships (Correlation, Linear Regression) 4 Q6: What predicts intention to go see “All About Steve”? Multiple Variables Interval or Ratio Scales: Comparing Means Need sample size, mean and standard deviation. Intuition: Is 2.5 different from 3.5? Yes: 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 No: 5 Comparing Two Means Independent means: Two different groups of people Who saw more movies, men or women? Dependent means: Two questions among same people 6 More movies on TV or in the theater? Comparing Independent Means TV Movies Seen x Gender Does the average number of TV movies seen differ by men vs. women? Men = 3.0 Women = 2.0 t Independent Samples => t-test X1 X 2 1 2 1 sp ( ) n1 n2 (n1 1) S12 (n2 1) S22 S (n 1 n2 2) 2 p 7 Comparing Independent Means TV Movies Seen x Gender 8 Comparing Independent Means TV Movies Seen x Gender 9 Comparing Independent Means TV Movies Seen x Gender t-Test: Two-Sample Assuming Unequal Variances Mean Variance Observations Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Variable 1 Variable 2 3.000 2.043 8.057 5.172 36.000 47.000 0.000 66.000 1.657 0.051 1.668 0.102 1.997 10% chance, if there was difference ( Should not reject the null) 10 Interpretation of P-value < .05 AKA What does .05 mean? 39% On average, 1/20 times, when you say “the average is differ by men and women”, you will be wrong. P-value = .102 for the two-tailed test Type I error Alpha error Sending an innocent man up the river 10.2% chance of getting this extreme a result when really 50-50. P-value = .0051 for the one-tailed test 5.1% chance of getting this low a result when really 50-50. Fail to reject Ho, at 95% confidence level. “Difference by gender is not statistically significant” Comparing Dependent Means Liking of Drama (V32) vs. Liking of Mysteries (V33) Q5: Is the average rating of Drama sdifferent from the average rating of Mysteries? Drama = 4.0 Mystery = 3.5 Calculate “difference score” for each person: D= V33-V32 Dependent Samples t-test w/ Ho: D=0 12 Comparing Dependent Means Drama vs. Mystery ratings t-Test: Paired Two Sample for Means Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail 13 Variable 1 3.964 0.767 83.000 0.355 0.000 82.000 3.746 0.000 1.664 0.000 1.989 Variable 2 3.542 0.861 83.000 < 0.1 % chance, if there was no difference (Should reject the null) Correlation: Comparable Measure of the Relationship Between Two Variables r Correlation is a ratio = r amount that two variables actually do co-vary =maximum amount that the two variables could co-vary COVARIANCE r =PRODUCT OF STANDARD DEVIATIONS It ranges from -1 to 1 1 and -1 are perfect positive and negative correlation 0 is no correlation at all Perfect Correlation Perfect Negative Correlation Correlation = -1 12 12 11 11 10 10 9 9 8 8 7 7 Trips Trips Perfect Positive Correlation Correlation = 1 6 5 6 5 4 4 3 3 2 2 1 1 0 0 0 2 4 6 Income ($10K) 8 10 12 0 2 4 6 Income ($10K) 8 10 12 High Correlation High Positive Correlation Correlation = 0.93 12 11 10 9 Trips 8 7 6 5 4 3 2 1 0 0 2 4 6 Income ($10K) 8 10 12 Lower Correlation Lower Negative Correlation Correlation = -0.5 12 12 11 11 10 10 9 9 8 8 7 7 Trips Trips Lower Positive Correlation Correlation = 0.5 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 2 4 6 Income ($10K) 8 10 12 0 2 4 6 Income ($10K) 8 10 12 No Correlation Correlation = 0 12 11 10 9 Trips 8 7 6 5 4 3 2 1 0 0 2 4 6 Income ($10K) 8 10 12 Misinterpreting No Correlation Measure of linear association 12 11 10 9 Trips 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 Income ($10K) Correlation = 0. No relationship? Factors Impacting Correlations Slope High Lower Lower Lowest Spread Correlations: Not a substitute for looking at the data Mean = 7.5 StDev = 4.12 Correlation = .81 Anscombe, American Statistician 1973 Correlations: Imperfect Measure of Relationships Source: Wikipedia Regression Analysis Objective Uses Quantify the relationship between a criterion (dependent) variable and one or more predictor (independent) variables Understanding how a predictor variable influences the dependent variable Predicting the dependent variable based on specified values of the predictor variables Forecasting how the dependent variable changes when the predictor variables change Examples Sales = f(prices, promotions, …) Satisfaction= f(service, tenure, segment, …) The Regression Equation True Model Dependent Variable (Satisfaction) Y Independent Variable (Service) 0 1 X 1 e Constant (Intercept) Coefficient of Independent Variable (Slope) Regression Assumptions: Relationship between variables is linear Errors are normally distributed, uncorrelated Amount of error is the same at every point on the line 25 How Helpful is the Regression? Recall that sum of squared error is a measure of inaccuracy and that smaller is better What % of Total Sum of Squared Error is the Sum of Squared Error associated with the regression? R2 is the amount of variance in the dependent variable explained by the independent variable (Correlation squared) 26 Application of Regression: Predicting “All About Steve” • Q6: What predicts intention to go see “All About Steve”? (V89) # 5 Answer 4 Likely 3 Undecided or Don't Know 2 Unlikely 1 Very Unlikely Very Likely Response 1 % 2% 2 5% 23 52% 7 16% 11 25% • Frequency of movie-viewing? V20: On TV V22: In a theater V24: On rented DVD V26: On owned DVD What Would Be A Good Predictor? Excel : Correlation 28 29 What Would Be A Good Predictor? Correlations: V89, V20, V22, V24, V26 V89 V20 V22 V24 V26 V89 1 V20 0.099 1.000 Movie theater V22 0.298 0.152 1.000 Rented V24 0.095 -0.026 0.127 1.000 Owned V26 0.217 0.275 0.140 0.184 TV movies 1.000 Potential Predictor V22: Movies seen in a movie theater Poor fit. Pearson correlation of V22 and V89 = .30 Influential First Regression: V89 ~ V22 32 33 SUMMARY OUTPUT R-squared: Measure of model fit (what % of variance in the data is explained by the model?) = Correlation2 .3 * .3 = .09 Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.298 0.089 0.067 0.964 44 ANOVA Significance test: Variance explained > 0? Compare error when predicting from the mean to errors from the model Significance test: Model terms ≠ 0? df Regression Residual Total Intercept V22 The regression equation is V89 = 2.196 + 0.105 V22 SS 1 42 43 3.795 39.000 42.795 Coefficients Standard Error 2.196 0.186 0.105 0.052 MS 3.795 0.929 t Stat 11.793 2.022 F 4.087 Significance F 0.050 P-value 0.000 0.050 Lower 95% 1.820 0.000 Graphic representation The regression equation is V89 = 2.196 + 0.105 V22 V22 Line Fit Plot 6 5 V89 4 3 V89 2 Predicted V89 1 0 0 5 10 V22 35 15 20 Multiple Regression Effect of each variable controlling for other variables in the model If predictors are uncorrelated, same as correlation If predictors are moderately correlated, useful to know If predictors are highly correlated, they are redundant Multicollinearity → Estimates are nonsensical, use correlations or form an index instead What About Other Predictors? Watching Owned Movies (V26) Second Regression: V89 ~ V22 + V26 SUMMARY OUTPUT R-squared is better… But: Predictors not significant Overall F-test is weaker Conclusion: Does not help – take V26 out. Regression Statistics Multiple R 0.347 R Square 0.120 Adjusted R Square 0.077 Standard Error 0.958 Observations 44.000 ANOVA df Regression Residual Total Intercept V22 V26 2 41 43 SS 5.145 37.650 42.795 Coefficien Standard ts Error 2.100 0.201 0.096 0.052 0.064 0.053 F 2.802 Significan ce F 0.072 t Stat P-value 10.428 0.000 1.844 0.072 1.213 0.232 Lower 95% 1.694 -0.009 -0.043 MS 2.573 0.918 Upper 95% 2.507 0.201 0.172 Lower 95.0% 1.694 -0.009 -0.043 Uppe 95.0% 2.5 0.2 0.1 How About Other Predictors? V37: Romantic Comedy Liking Third Regression: V89 ~ V22 + V37 SUMMARY OUTPUT R-squared is a lot better… Overall F-test is better Regression Statistics Multiple R 0.387 R Square 0.150 Adjusted R Square 0.108 Standard Error 0.942 Observations 44.000 V22 & V37 are near significant ANOVA Even controlling for attitude towards romantic comedy, positive effect of seeing more movies df Regression Residual Total …but what about gender? Intercept V22 V37 2 41 43 SS 6.402 36.393 42.795 Coefficien Standard ts Error 3.148 0.585 0.101 0.051 -0.253 0.148 F 3.606 Significan ce F 0.036 t Stat P-value 5.385 0.000 1.994 0.053 -1.714 0.094 Lower 95% 1.968 -0.001 -0.552 MS 3.201 0.888 Upper 95% 4.329 0.203 0.045 Lower 95.0% 1.968 -0.001 -0.552 Final Model V22, V37 and Gender Final Regression: V89 ~ V22 + V37 + V105 SUMMARY OUTPUT Best model. Why not include V26 ? Avoids OVERFITTING: Maximizing prediction in sample, reduces prediction out of sample. Regression Statistics Multiple R 0.458 R Square 0.210 Adjusted R Square 0.151 Standard Error 0.919 Observation s 44 ANOVA df Regression Residual Total Intercept V22 V37 V105 3 40 43 SS 8.985 33.810 42.795 Coefficien Standard ts Error 2.862 0.593 0.117 0.050 -0.434 0.177 0.606 0.347 MS 2.995 0.845 F 3.543 t Stat P-value 4.823 0.000 2.330 0.025 -2.446 0.019 1.748 0.088 Significan ce F 0.023 Lower Upper Lower Upper 95% 95% 95.0% 95.0% 1.663 4.062 1.663 4.062 0.016 0.219 0.016 0.219 -0.792 -0.075 -0.792 -0.075 -0.095 1.307 -0.095 1.307 Cautions About Regression: Model Building Overfitting Omitted variable/Mis-specification bias: Interpretation of coefficients can be wrong when key variables are left out Causality assumption Overly complex models fit “noise” in the data rather than generalizable patterns Regression is relational, can’t assume predictors cause the dependent variable. And more … Endogeneity, heterogeneity … Why Model? 1. Explanatory: To understand the relationships among variables in a process. Validating survey measures. How does intent relate to sales? Test hypotheses. Is ad recall related to purchase intent? Assess relationships. Which attributes are most closely related to favorability? Variable meaning / model specification important Wyner, “Why Model?” Marketing Reserch 2006 Why Model? 2. Predictive: To make predictions based on the available data. Identifying high value prospects. Which of these variables are predictive of responding to an offer? What is the probability of a given person to respond? Predicting an outcome. What is the predicted sales of a new product based on an analysis of past products? Variable specification, causality assumptions less of an issue (“proxy” variables OK). Caution: overfitting, unusual cases! Wyner, “Why Model?” Marketing Research 2006 Why Model? 3. Decision Support: To quantify relative consequences of management actions What if…? What if we change our pricing, media-mix, promotion strategy, etc… Product design. Which combination of features would yield the most attractive product? Strong causality assumptions, model based on “levers” (even if small effect), must “control for” other factors (omitted variable bias). Applicable in observed range only. Wyner, “Why Model?” Marketing Research 2006 Data Analysis - Summary Basic Data Analysis (summary statistics, histogram) Q1. % had seen a movie in the last week Q2. Distribution of # of movies seen on TV since end of Fall. One Variable Comparisons (Joint distribution/Pivot table T-Tests) Q 3. Who saw a movie last week? (% male, % Female) Q 4. Does the average number of TV movies seen differ by men vs. women? Two Variables Q5: Is the average rating of Drama different from the average rating of Mysteries? Relationships (Correlation, Linear Regression) 44 Q6: What predicts intention to go see “All About Steve”? Multiple Variables “David McCandless: The Beauty of Data Visualization” 45 Data Visualization 46 Edward Tufte is an American statistician and professor emeritus of political science, statistics, and computer science at Yale University. He is noted for his writings on information design and as a pioneer in the field of data visualization. The Minard Map - "The best statistical graphic ever drawn" Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian campaign of 1812. Beginning at the PolishRussian border, the thick band shows the size of the army at each position. The path of Napoleon's retreat from Moscow in the bitterly cold winter is depicted by the dark lower band, which is tied to 47 temperature and time scales. A picture is worth a thousand words As a statistical chart, the map unites six different sets of data. 48 Geography: rivers, cities and battles are named and placed according to their occurrence on a regular map. The army’s course: the path’s flow follows the way in and out that Napoleon followed. The army’s direction: indicated by the colour of the path, gold leading into Russia, black leading out of it. The number of soldiers remaining: the path gets successively narrower, a plain reminder of the campaigns human toll, as each millimetre represents 10.000 men. Temperature: the freezing cold of the Russian winter on the return trip is indicated at the bottom, in the republican measurement of degrees of réaumur (water freezes at 0° réaumur, boils at 80° réaumur). Time: in relation to the temperature indicated at the bottom, from right to left, starting 24 October (pluie, i.e. ‘rain’) to 7 December (27°).