Part IV: Causality
Multivariate Regression
Chapter 11
Prof. Amine Ouazad
• Can we predict the success of a movie?
1. Avatar (2009)
2. Titanic (1997)
3. The Avengers (2012)
4. The Dark Knight (2008)
$760,505,847
$658,672,302
$623,279,547
$533,316,061
5. Star Wars: Episode I – The Phantom Menace
(1999) $474,544,677
• Box_mil = First run U.S. box office (Millions of $)
• MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.
• Budget = Production budget (Millions of $)
• Starpowr = Index of star power
• Sequel = 1 if movie is a sequel, 0 if not
• Action = 1 if action film, 0 if not
• Comedy = 1 if comedy film, 0 if not
• Animated = 1 if animated film, 0 if not
• Horror = 1 if horror film, 0 if not
• Addict = Trailer views at traileraddict.com
• Cmngsoon = Message board comments at comingsoon.net
• Fandango = Attention at fandango.com
• Cntwait3 = Percentage of Fandango votes that can't wait to see.
P ART I. I NTRODUCTION AND R ESEARCH D ESIGN Week 1
Four Steps of “Thinking Like a Statistician”
Study Design: Simple Random Sampling, Cluster Sampling, Stratified Sampling
Biases: Nonresponse bias, Response bias, Sampling bias
P ART II. D ESCRIBING DATA Weeks 2-4
Sample statistics: Mean, Median, SD, Variance, Percentiles, IQR, Empirical Rule
Bivariate sample statistics: Correlation, Slope
P ART III. D RAWING CONCLUSIONS FROM DATA :
I NFERENTIAL S TATISTICS
Weeks 5-9
Estimating a parameter using sample statistics. Confidence Interval at 90%, 95%, 99%
Testing a hypothesis using the CI method and the t method.
P ART IV. : C ORRELATION AND C AUSATION :
Weeks 10-14
Multivariate regression now!
T WO G ROUPS , R EGRESSION A NALYSIS
• “Comparison of Two Groups”
Last week.
• “Univariate Regression Analysis”
Last Saturday, Section 9.5.
• “Association and Causality: Multivariate Regression ”
Last Saturday, Chapter 10.
Today, Tomorrow, Chapter 11.
• “Randomized Experiments and ANOVA”.
Wednesday. Chapter 12.
• “Robustness Checks and Wrap Up”.
Last Thursday.
Outline
1. Multivariate regression
2. Interpreting coefficients
Ceteris Paribus
3. Standardized Coefficient
4. Multiple Correlation and R Squared
Next time: Multivariate regression: the F test (Continued)
• y Box = First run U.S. box office ($)
• x
1
• x
2
• x
3
• x
4
• x
5
• x
6
• x
7
• x
8
• x
9
• x
10
• x
11
• x
12
MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.
Budget = Production budget ($Mil)
Starpowr = Index of star power
Sequel = 1 if movie is a sequel, 0 if not
Action = 1 if action film, 0 if not
Comedy = 1 if comedy film, 0 if not
Animated = 1 if animated film, 0 if not
Horror = 1 if horror film, 0 if not
Addict = Trailer views at traileraddict.com
Cmngsoon = Message board comments at comingsoon.net
Fandango = Attention at fandango.com
Cntwait3 = Percentage of Fandango votes that can't wait to see.
• With variables x
1
, x
2
, …, x
12
.
• We are trying to get the true impact:
b
1
b
2
… of variable x
1 of variable x
2
b
12 of variable x
K
• True model: y = a
+ b
1 x
1 on y.
on y.
on y.
+ b
2 x
2
+ b
3 x
3
+ … + b
12 x
12
+ e
We would get those if we had the population of all possible movies.
• Instead we estimate b
1
, b
2
, …, b
K sample: on the
– Minimizing the sum of the squared prediction error !
• With these we can predict the success of a movie:
3
• We only observe one coefficient estimate b
3
, because we have only one sample.
• But across all possible samples, the sampling distribution of b
3 is bell-shaped.
• Hence we can design a test:
• H
0
: “ b
3
= 0 ”
Under H
0
, follows a t distribution with N – (K + 1) degrees of freedom.
0
b
3
• Reject the null hypothesis at 95% if:
– The absolute value of the t statistic is greater than the t score with N – (K+1) degrees of freedom at
95%.
– Equivalently, if the p value is lower than 0.05.
There are as many null hypothesis as there are coefficients to estimate :
Here, there are
Outline
1. Multivariate regression
2. Interpreting coefficients
Ceteris Paribus
3. Standardized Coefficient
4. Multiple Correlation and R Squared
Next time: Multivariate regression (Continued)
• “All other things equal”, what is the impact of variable x
3 on box office outcome in millions of $?
Increase in x3
(Star power)
Increase in starpower (variable x
3
) all other things equal.
Keep x
1
,x
2
,x
4
,x
5
,x
6
,x
7
,x
8
,x
9
,x
10
,x
12 constant ! And change x
3
.
• “All other things equal”, what is the impact of variable x
3 on box office outcome in millions of $?
Increase in x
2
(Budget) by 1 million $
Increase in budget(variable x
2
) all other things equal.
Keep x
1
,x
3
,x
4
,x
5
,x
6
,x
7
,x
8
,x
9
,x
10
,x
12 constant ! And change x
3
.
• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.
• An action movie has on average all other things equal a lower box office outcome, by $12 million.
• An increase in the ‘Percentage of Fandango votes that can't wait to see’ (cntwait3) by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.
We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.
• x
1
• x
2
• x
3
• x
4
• x
5
• x
6
• x
7
• x
8
• x
9
• x
10
• x
11
• x
12
MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.
Budget = Production budget ($Mil)
❏❏❏
❏❏❏
Starpowr = Index of star power
Sequel = 1 if movie is a sequel, 0 if not
❏❏❏
❏❏❏
Action = 1 if action film, 0 if not
Comedy = 1 if comedy film, 0 if not
❏❏❏
❏❏❏
Animated = 1 if animated film, 0 if not
❏❏❏
Horror = 1 if horror film, 0 if not
❏❏❏
Addict = Trailer views at traileraddict.com
❏❏❏
Cmngsoon = Message board comments at comingsoon.net
❏❏❏
Fandango = Attention at fandango.com
❏❏❏
Cntwait3 = Percentage of Fandango votes that can't wait to see.
❏❏❏
Read the p value !!! Or compare the t stat to the t score with N-13 degrees of freedom
• Without budget among the variables, the popularity cntwait3 has a bigger impact…
• Than with budget included.
Budget
Box office (box_mil)
Cntwait3
We know that Budget and Cntwait3 are correlated (an arrow either in one direction or in the other, or both) because including Budget affects the coefficient of Cntwait3
Outline
1. Multivariate regression
2. Interpreting coefficients
Ceteris Paribus
3. Standardized Coefficient
4. Multiple Correlation and R Squared
Next time: Multivariate regression (Continued)
We just saw:
• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.
But is 1 million $ big? Is 0.144 million $ big?
• “a 1 standard deviation increase in x
2
, leads to a …. % standard deviation increase in y.”
• Standard deviation of x2 (budget): 42.9.
• Standard deviation of y (box office outcome):
17.5.
• Coefficient of budget: 0.144.
• Fill in the blank.
• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.
• An action movie has on average all other things equal a lower box office outcome, by $12 million.
• An increase in the ‘Percentage of Fandango votes that can't wait to see’ (cntwait3) by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.
We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.
Outline
1. Multivariate regression
2. Interpreting coefficients
Ceteris Paribus
3. Standardized Coefficient
4. Multiple Correlation and R Squared
Next time: Multivariate regression (Continued)
• How good are we at predicting the success of a movie?
• The multiple correlation is 1 if we are absolutely correct in our predictions. e i
=0 for every movie.
• The multiple correlation is 0 if we do not better than taking the average. e i
=
ESS/TSS = 13356/18665 = 0.7156
• We can use a number of variables to explain a dependent variable.
• Multiple regression accounts for multiple causes.
• The coefficients minimize the sum of the squared residuals.
• Understand the t test and the p value.
• The coefficients should be understood “all other things equal” or “ceteris paribus”.
• The standardized coefficients express effects in terms of standard deviations.
• The R squared between 0 and 100% measures how accurate our predictions are.
• Schedule for next week:
• Chapter on “Association and Causality”, and “Multivariate Regression”.
• Make sure you come to sessions and recitations.
Sunday
Recitation
Monday
Multivariate
Regression
Evening session
7.30pm
West
Administration
002
Tuesday
Multivariate
Regression
The F test
Usual class
12.45pm
Usual room
Wednesday
Randomized
Experiments and
ANOVA
Thursday
Wrap up
Evening session
7.30pm
West
Administration
001
Usual class
12.45pm
Usual room