Slides for Session #23

advertisement

Statistics for Social and Behavioral Sciences

Part IV: Causality

Multivariate Regression

Chapter 11

Prof. Amine Ouazad

Movie Buzz

• Can we predict the success of a movie?

1. Avatar (2009)

2. Titanic (1997)

3. The Avengers (2012)

4. The Dark Knight (2008)

$760,505,847

$658,672,302

$623,279,547

$533,316,061

5. Star Wars: Episode I – The Phantom Menace

(1999) $474,544,677

Data

• Box_mil = First run U.S. box office (Millions of $)

• MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.

• Budget = Production budget (Millions of $)

• Starpowr = Index of star power

• Sequel = 1 if movie is a sequel, 0 if not

• Action = 1 if action film, 0 if not

• Comedy = 1 if comedy film, 0 if not

• Animated = 1 if animated film, 0 if not

• Horror = 1 if horror film, 0 if not

• Addict = Trailer views at traileraddict.com

• Cmngsoon = Message board comments at comingsoon.net

• Fandango = Attention at fandango.com

• Cntwait3 = Percentage of Fandango votes that can't wait to see.

Statistics Course Outline

P ART I. I NTRODUCTION AND R ESEARCH D ESIGN Week 1

Four Steps of “Thinking Like a Statistician”

Study Design: Simple Random Sampling, Cluster Sampling, Stratified Sampling

Biases: Nonresponse bias, Response bias, Sampling bias

P ART II. D ESCRIBING DATA Weeks 2-4

Sample statistics: Mean, Median, SD, Variance, Percentiles, IQR, Empirical Rule

Bivariate sample statistics: Correlation, Slope

P ART III. D RAWING CONCLUSIONS FROM DATA :

I NFERENTIAL S TATISTICS

Weeks 5-9

Estimating a parameter using sample statistics. Confidence Interval at 90%, 95%, 99%

Testing a hypothesis using the CI method and the t method.

P ART IV. : C ORRELATION AND C AUSATION :

Weeks 10-14

Multivariate regression now!

T WO G ROUPS , R EGRESSION A NALYSIS

Coming up

• “Comparison of Two Groups”

Last week.

• “Univariate Regression Analysis”

Last Saturday, Section 9.5.

• “Association and Causality: Multivariate Regression ”

Last Saturday, Chapter 10.

Today, Tomorrow, Chapter 11.

• “Randomized Experiments and ANOVA”.

Wednesday. Chapter 12.

• “Robustness Checks and Wrap Up”.

Last Thursday.

Outline

1. Multivariate regression

2. Interpreting coefficients

Ceteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression: the F test (Continued)

Data: Variables

• y Box = First run U.S. box office ($)

• x

1

• x

2

• x

3

• x

4

• x

5

• x

6

• x

7

• x

8

• x

9

• x

10

• x

11

• x

12

MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.

Budget = Production budget ($Mil)

Starpowr = Index of star power

Sequel = 1 if movie is a sequel, 0 if not

Action = 1 if action film, 0 if not

Comedy = 1 if comedy film, 0 if not

Animated = 1 if animated film, 0 if not

Horror = 1 if horror film, 0 if not

Addict = Trailer views at traileraddict.com

Cmngsoon = Message board comments at comingsoon.net

Fandango = Attention at fandango.com

Cntwait3 = Percentage of Fandango votes that can't wait to see.

Multivariate Regression

• With variables x

1

, x

2

, …, x

12

.

• We are trying to get the true impact:

 b

1

 b

2

 … of variable x

1 of variable x

2

 b

12 of variable x

K

• True model: y = a

+ b

1 x

1 on y.

on y.

on y.

+ b

2 x

2

+ b

3 x

3

+ … + b

12 x

12

+ e

We would get those if we had the population of all possible movies.

Multivariate Regression

• Instead we estimate b

1

, b

2

, …, b

K sample: on the

– Minimizing the sum of the squared prediction error !

• With these we can predict the success of a movie:

Sampling Distribution of b

3

• We only observe one coefficient estimate b

3

, because we have only one sample.

• But across all possible samples, the sampling distribution of b

3 is bell-shaped.

• Hence we can design a test:

• H

0

: “ b

3

= 0 ”

Under H

0

, follows a t distribution with N – (K + 1) degrees of freedom.

Hypothesis testing for H

0

: “

b

3

=0”

• Reject the null hypothesis at 95% if:

– The absolute value of the t statistic is greater than the t score with N – (K+1) degrees of freedom at

95%.

– Equivalently, if the p value is lower than 0.05.

There are as many null hypothesis as there are coefficients to estimate :

Here, there are

Outline

1. Multivariate regression

2. Interpreting coefficients

Ceteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression (Continued)

Ceteris Paribus

=“All other things equal”

“All other things equal”, what is the impact of variable x

3 on box office outcome in millions of $?

Increase in x3

(Star power)

Increase in starpower (variable x

3

) all other things equal.

Keep x

1

,x

2

,x

4

,x

5

,x

6

,x

7

,x

8

,x

9

,x

10

,x

12 constant ! And change x

3

.

Ceteris Paribus

=“All other things equal”

“All other things equal”, what is the impact of variable x

3 on box office outcome in millions of $?

Increase in x

2

(Budget) by 1 million $

Increase in budget(variable x

2

) all other things equal.

Keep x

1

,x

3

,x

4

,x

5

,x

6

,x

7

,x

8

,x

9

,x

10

,x

12 constant ! And change x

3

.

Reading the coefficients

• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.

• An action movie has on average all other things equal a lower box office outcome, by $12 million.

• An increase in the ‘Percentage of Fandango votes that can't wait to see’ (cntwait3) by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.

We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.

Which coefficients are statistically significant?

• x

1

• x

2

• x

3

• x

4

• x

5

• x

6

• x

7

• x

8

• x

9

• x

10

• x

11

• x

12

MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.

Budget = Production budget ($Mil)

❏❏❏

❏❏❏

Starpowr = Index of star power

Sequel = 1 if movie is a sequel, 0 if not

❏❏❏

❏❏❏

Action = 1 if action film, 0 if not

Comedy = 1 if comedy film, 0 if not

❏❏❏

❏❏❏

Animated = 1 if animated film, 0 if not

❏❏❏

Horror = 1 if horror film, 0 if not

❏❏❏

Addict = Trailer views at traileraddict.com

❏❏❏

Cmngsoon = Message board comments at comingsoon.net

❏❏❏

Fandango = Attention at fandango.com

❏❏❏

Cntwait3 = Percentage of Fandango votes that can't wait to see.

❏❏❏

Read the p value !!! Or compare the t stat to the t score with N-13 degrees of freedom

With Budget

Without Budget

Budget and Can’t Wait to See the movie !

• Without budget among the variables, the popularity cntwait3 has a bigger impact…

• Than with budget included.

Budget

Box office (box_mil)

Cntwait3

We know that Budget and Cntwait3 are correlated (an arrow either in one direction or in the other, or both) because including Budget affects the coefficient of Cntwait3

Outline

1. Multivariate regression

2. Interpreting coefficients

Ceteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression (Continued)

Standardized Coefficient

We just saw:

• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.

But is 1 million $ big? Is 0.144 million $ big?

Standardized Coefficient

• “a 1 standard deviation increase in x

2

, leads to a …. % standard deviation increase in y.”

• Standard deviation of x2 (budget): 42.9.

• Standard deviation of y (box office outcome):

17.5.

• Coefficient of budget: 0.144.

• Fill in the blank.

Standardized Coefficient

• An increase in budget by 1 million $ leads to a rise in box office $ of 0.144 million $, all other things equal.

• An action movie has on average all other things equal a lower box office outcome, by $12 million.

• An increase in the ‘Percentage of Fandango votes that can't wait to see’ (cntwait3) by 1 percentage point leads to a 0.01 * 32.15 = 0.3215 M$ increase in box office outcome in $.

We multiply by 0.01 (1%) because cntwait3 ranges from 0 to 1.

Outline

1. Multivariate regression

2. Interpreting coefficients

Ceteris Paribus

3. Standardized Coefficient

4. Multiple Correlation and R Squared

Next time: Multivariate regression (Continued)

R Squared

• How good are we at predicting the success of a movie?

• The multiple correlation is 1 if we are absolutely correct in our predictions. e i

=0 for every movie.

• The multiple correlation is 0 if we do not better than taking the average. e i

=

ESS/TSS = 13356/18665 = 0.7156

Wrap up

• We can use a number of variables to explain a dependent variable.

• Multiple regression accounts for multiple causes.

• The coefficients minimize the sum of the squared residuals.

• Understand the t test and the p value.

• The coefficients should be understood “all other things equal” or “ceteris paribus”.

• The standardized coefficients express effects in terms of standard deviations.

• The R squared between 0 and 100% measures how accurate our predictions are.

Coming up:

• Schedule for next week:

• Chapter on “Association and Causality”, and “Multivariate Regression”.

• Make sure you come to sessions and recitations.

Sunday

Recitation

Monday

Multivariate

Regression

Evening session

7.30pm

West

Administration

002

Tuesday

Multivariate

Regression

The F test

Usual class

12.45pm

Usual room

Wednesday

Randomized

Experiments and

ANOVA

Thursday

Wrap up

Evening session

7.30pm

West

Administration

001

Usual class

12.45pm

Usual room

Download