Slides for Session #6: Correlation, the regression line

advertisement
Statistics for Social
and Behavioral Sciences
Session #6: The Regression Line C’ted
(Agresti and Finlay, Chapter 9)
Prof. Amine Ouazad
Statistics Course Outline
PART I. INTRODUCTION AND RESEARCH DESIGN
PART II. DESCRIBING DATA
Where we are right
now!
Describing associations
between two variables
PART III. DRAWING CONCLUSIONS FROM DATA:
INFERENTIAL STATISTICS
Week 1
Weeks 2-4
Weeks 5-9
Firenze or Lebanese Express?
PART IV. : CORRELATION AND CAUSATION:
REGRESSION ANALYSIS
This is where we talk
about Zmapp and Ebola!
Weeks 10-14
Last Session
• From a scatter plot to a linear relationship
– A linear relationship is a model, imperfect.
– A linear relationship implies constant gradients.
– A linear relationship helps predict/extrapolate,
interpolate to fill missing statistics.
• Finding the regression line
– The regression line minimizes the sum of squared
errors.
– The formula for a and b are essential to learn.
Outline
1. The Regression Line (C’ted)
– Last time’s recap
– Why we call it regression
2. Warning: Correlation is not causation
– Spurious relationships
– Being agnostic about causality: correlation
3. How well does the linear model perform?
Next session:
Bivariate analysis
Chapter 9 of A&F, continued
Finding the regression line
• Which line is the right one?
• A line is entirely determined by the choice of
a and b.
An essential formula.
N
å(x - x )(y - y )
i
b=
i
a = y - bx
i=1
N
2
(x
x
)
å i
x is the explanatory variable
y is the response variable
i=1
Notice the difference between b and b, between a and a.
If y increases when x
increases, then b>0
If y decreases when x
increases, then b<0
Why do we call this regression?
Sir F. Galton
• “Regression towards mediocrity in Hereditary Stature”,
Sir Francis Galton, 1886.
What are y,x,b here?
Understanding Galton: Questions
• A little exercise to understand Sir Francis:
1. What is the data? How many observations? What is
y? What is x?
2. Write the assumed linear relationship
between y and x.
3. Can you express the mean of y?
(as a function of the mean of x)
4. Take the difference between child i’s height and
children’s mean height.
5. How does it relate to the difference between child i’s
parents’ midheight and the the mean of parents’
midheight?
Sir F. Galton
I use mean and average interchangeably in this course. Same formula
Outline
1. The Regression Line (C’ted)
– Last time’s recap
– Why we call it regression
2. Warning: Correlation is not causation
– Spurious relationships
– Being agnostic about causality: correlation
3. How well does the linear model perform?
Next session:
Bivariate analysis
Chapter 9 of A&F, continued
“More than a fifth of people on
unemployment benefits have a criminal
record, government figures have
revealed.
The new data showed an estimated 22
per cent of all people claiming out of
work claimants - such as Jobseeker’s
Allowance - were made by people who
had been to prison or convicted of an
offence in the previous 12 years.”
Chris Grayling, the Justice Secretary, is
pushing through reforms which aim to
provide more support to offenders who
are released from jail back into the
community.
Jeremy Wright, the justice minister,
said: “We are committed to delivering
long-needed changes that will see all
offenders released from prison receive
targeted support to finally turn
themselves around and start
contributing to society.”
Unemployment and Crime
“The figures also showed 44 per cent of offenders were
claiming benefits a month after being convicted, cautioned or
released from jail.”
“More than half of offenders - 54 per cent - released from
prison were claiming out-of-work benefits one month later,
gradually decreasing to 42 per cent two years after.”
“In all, 214,000 people claiming out-of-work benefits had been
to prison at least once in the previous 12 years, or 4 per cent
of the total.”
“Previous data published in 2011 estimated the proportion of
criminal claimants was slightly higher, at 26 per cent, but a
Ministry of Justice spokesman said the sets of figures were not
directly comparable.”
Chris Grayling,
Justice Secretary (UK)
Association is not causation
What Drives Obesity?
Is higher obesity due to the rise in driving?
Perhaps. It’s an intriguing hypothesis. But our
friends at The Economist should know better
than to report nonsensical correlations. Here’s
the evidence they cite (drawn from this
entirely
unconvincing
research
paper
published in Transport Policy):
Looks impressive, right? (Well, apart from
putting the explanatory variable on the vertical
axis.) But before concluding that there’s
anything here, let’s try a different variable,
instead—my age:
Reading is an important skill, and elementary school
teachers have observed that the reading ability of their
students tends to increase with their shoe size. To help
boost reading skills, should policymakers offer prizes to
scientists to devise methods to increase the shoe size of
elementary school children? Obviously, the tendency for
shoe size and reading ability to increase together does not
mean that big feet cause improvements in reading skills.
Older children have bigger feet, but they also have more
developed brains. This natural development of children
explains the simple observation that shoe size and reading
ability have a tendency to increase together—that is, they
are positively correlated. But clearly there is no relationship:
bigger shoe size does not cause better reading ability.
In economics, correlations are common. But identifying
whether the correlation between two or more variables
represents a causal relationship is rarely so easy. Countries
that trade more with the rest of the world also have higher
income levels—but does this mean that trade raises income
levels? People with more education tend to have higher
earnings, but does this imply that education results in higher
earnings? Knowing precise answers to these questions is
important. If additional years of schooling caused higher
earnings, then policymakers could reduce poverty by
providing more funding for education. If an extra year of
education resulted in a $20,000 a year increase in earnings,
then the benefits of spending on education would be a lot
larger than if an extra year of education caused only a $2 a
year increase.
Economists need
statisticians
Association is not causation
• The response variable may be the explanatory variable and vice
verse (reverse causation).
• There may be other factors that affect the response variable, other
than the explanatory variable.
☞ Multivariate statistics coming up in week 12.
Univariate statistics
Bivariate statistics
Multivariate statistics
Inspecting the
distribution of one
variable.
Discovering associations
between 2 variables.
Uncovering causality:
looking at the impact of
multiple explanatory
variables on one
response variable
Am I taller than the
average? Than the
median?
What percentile of the
distribution do I belong
to?
Weeks 1 and 2
What is the relationship
between parents’ height
and children’s height?
What is the relationship
between unemployment
and crime?
Now and next week
What factors cause
crime? Poverty,
unemployment, guns,
police headcounts?
Week 12
The correlation of two variables
• The correlation of two variables is:
1 N
(xi - x )(yi - y)
å
N
r(x, y) = i=1
sx sy
A sum of N observations:
fortunately a computer will
usually do it (Stata)
• The correlation does not make an assumption
about the direction of causality (The slope does)
• It is, however, related to the slope:
æs ö
r(x, y) = çç x ÷÷ b
è sy ø
Standard dev. of x
Slope
Correlation
Standard dev. of y
An Example:
Unemployment and Murders – The Sequel
•
•
•
•
•
Standard deviation of Unemployed Persons: 5,901.259
Standard deviation of Murders:
20.44
Regression line: we find b = 0.00285 and a = -1.96
The correlation r(Unemployed, Murders) is: 0.83.
Self-check?
Properties of the correlation
• The correlation is a number between -1 and 1,
sometimes (but rarely) expressed as a percentage.
• If two variables have a correlation of 1, we say
that they are perfectly correlated…
– Example: student expenses in USD are perfectly
correlated with student expenses in AED.
– y is exactly a+b x, with b>0.
• If two variables have a correlation of -1, the two
variables are exactly such that y = a + b x, with
b<0.
– Example: Number of days to New Year’s eve, Number
of days from New Year’s eve.
Outline
1. The Regression Line (C’ted)
– Last time’s recap
– Why we call it regression
2. Warning: Correlation is not causation
– Spurious relationships
– Being agnostic about causality: correlation
3. How well does the linear model perform?
Next session:
Bivariate analysis
Chapter 9 of A&F, continued
How “good” are our predictions?
Aouch: we make errors.
The actual yi
And the predicted yi, noted:
y
The regression line minimizes the
sum of the squared errors:
Remember the formula for b and a.
When does a model predict y
perfectly?
When does the model have no
predictive power?
The regression line.
x
Playing with the R Squared
• The R Squared is :
• Answers the question(s):
– “What fraction of the variance of the response
variable is explained by the explanatory variable?”
– “What percentage of the variance of the response
variable is explained by the explanatory variable?”
• Measures the fit of the linear model.
• The R squared is also the square of the
correlation between x and y !
R2=r2
An Example:
Unemployment and Murders – The Sequel
An Example:
Unemployment and Murders – The Sequel
An Example:
Unemployment and Murders – The Sequel
• The variance of the predicted
number of murders is:
284.3
• The variance of the actual
number of murders :
417.8
• The R Squared is:
Not bad !!
• Side question: what is the variance of the errors (residuals)?
Remember: variance(y) = variance(prediction) + variance(error)
Follow my lead, it’s easier
Wrap up
• Finding the regression line (Sir Galton)
– The regression line minimizes the sum of squared errors.
– The formulas for a and b are essential.
• Association is not causation
– Does x cause y or does y cause x?
– Is there any other factor that may cause y?
– Being agnostic about the direction of causality: the correlation r.
• How good are my predictions? How good is my model?
– Use the R Squared, know its formula.
– The variance is the square of the standard deviation.
Next session: Minority Report continues
Don’t forget:
• Midterm 1 coming up in week 5 (exact date coming soon from the Registrar Mary Downes).
• Online Quiz #3 starting tonight at 9pm, due Tuesday at 9am.
• Sunday recitation on: “The Regression Line: ‘Education and Economic Growth.’”
•
In chapter 9, read everything except Section 9.5 (Inferences for the Slope)
For help:
• Amine Ouazad
Office 1135, Social Science building
amine.ouazad@nyu.edu
Office hour: Wednesday from 4 to 5pm.
• GAF: Irene Paneda
Irene.paneda@nyu.edu
Sunday recitations.
At the Academic Resource Center, Monday from 2 to 4pm.
Download