Slides for Session #9

advertisement
Statistics for Social
and Behavioral Sciences
Session #9: Linear Regression and
Conditional distribution
Probabilities
(Agresti and Finlay, Chapter 9)
Prof. Amine Ouazad
Statistics Course Outline
PART I. INTRODUCTION AND RESEARCH DESIGN
PART II. DESCRIBING DATA
Where we are right
now!
Describing associations
between two variables
PART III. DRAWING CONCLUSIONS FROM DATA:
INFERENTIAL STATISTICS
Week 1
Weeks 2-4
Weeks 5-9
Firenze or Lebanese Express?
PART IV. : CORRELATION AND CAUSATION:
REGRESSION ANALYSIS
This is where we talk
about Zmapp and Ebola!
Weeks 10-14
Last session
• How good are my predictions? How good is my
model?
– Use the R Squared = ESS/TSS.
– TSS = ESS + SSE.
– The notations TSS, ESS, SSE are widespread.
– The variance is the square of the standard deviation.
– The R squared is also the square of the correlation of
the predicted value and the actual value.
Outline
1. Conditional distribution
– What wage will I earn after graduation?
2. Probabilities (Chapter 4)
After the Break:
Probability Distributions
Chapter 4 of A&F
WHEN LaTisha Styles graduated from Kennesaw State
University in Georgia in 2006 she had $35,000 of student debt.
This obligation would have been easy to discharge if her
Spanish degree had helped her land a well-paid job. But there
is no shortage of Spanish-speakers in a nation that borders
Latin America. So Ms Styles found herself working in a clothes
shop and a fast-food restaurant for no more than $11 an hour.
Frustrated, she took the gutsy decision to go back to the same
college and study something more pragmatic. She majored in
finance, and now has a good job at an investment consulting
firm. Her debt has swollen to $65,000, but she will have little
trouble paying it off.
A Contingency Table
(From Previous Session)
We will learn how to
produce this later in
the course. For now,
let’s
interpret/understan
d this.
Shows the average weekly
earnings for each year of
education.
• But can I do a regression analysis here?
What wage will I earn after
graduation?
• Data: Census of Population 2010.
• The United States Census is a decennial census
mandated by Article I, Section 2 of the United States
Constitution, which states: "Representatives and direct
Taxes shall be apportioned among the several States ...
according to their respective Numbers ... . The actual
Enumeration shall be made within three Years after the
first Meeting of the Congress of the United States, and
within every subsequent Term of ten Years.”
• Variables:
– Number of years of education completed.
– Wage income.
• We can only perform regression analysis on
quantitative variables.
Linear Relationship anybody?
• We can postulate that there is a linear relationship
between wage income(y) and years of schooling (x).
y = a + bx +e
• Using greek letters here. True relationship.
• Notice the importance of residuals (aka errors)
• Units of measurement matter. Make sure you read the
fine print.
– y is annual income in dollars. x is in years.
• Also, with a linear relationship, an additional year of
education leads the same increase in income at any
stage of your education process.
– Makes sense? Check the contingency table.
• We keep the linear relationship as a convenient model.
Estimation of a and b
• We estimate a and b by computing the values of a and b. We
only have a sample, not the entire population.
• So, what earnings can we expect?
Linear? A contigency table
y
Wage
Income
b
1
a
Years of
schooling
(… Continued …)
• Here, years of schooling (x) is quantitative discrete, so we can do
both regression analysis and a contingency table!
The unconditional distribution of
income and education
• We find that the mean and the standard deviation of the variables
are as follows:
– Annual income y
– Years of schooling x
mean: $41,550
mean: 12.25 years
SD: $48,659
SD: 1.6 years
• Assuming a bell shaped distribution
– Most earnings will fall between: mean +- 3 sd
– 95% of earnings will fall between: mean +- 2 sd
– 68% of earnings will fall between: mean +- 1 sd
• Interesting: could do a risk analysis with that data:
– What is the probability that you earn more than mean + 2 sd?
• But the unconditional distribution of annual income mixes both
individuals with high and low levels of education…
The conditional distribution of
income and education
• So instead of using the unconditional distribution of income (aka marginal
distribution), we use the conditional distribution of income.
• “What is the distribution of income given that an individual studied for x
years?”
Earnings y
Education
x
Understanding the
mean of income given x
• After x years of education, the predicted
(mean) annual income will be:
a+bx
-123.610 + 2,689.936 x
• With x = 16 …. We find $42,915 !
• Good or bad?
Understanding the risks:
Approach #1
• Use the fact that TSS = ESS + SSE.
– The ESS measures how education explains the variance of earnings.
• From this we find that Var(y) = Var(predictions) + Var(error).
– How do we go from TSS=ESS+SSE to this?
• But that is the variance of the unconditional distribution of y.
• How can we find the variance of earnings given a level of education?
• In such a case Var(y) given a level of education is Var(y given x)=Var(error).
• And thus the standard deviation of earnings given a level of education is:
– SD(residuals)
= square root of (SSE/N) = sqrt (513012622113699.1/1460042)
= $18,744
• Applying the empirical rule… we find that most annual incomes will lie
between:
$ 42,915 - 3 x $18,744 and $ 42,915 + 3 x $18,744
$0 and $99,417
Understanding the risks:
Approach #2
• Use our beautiful formula:
æs ö
r(x, y) = çç x ÷÷ b
è sy ø
Correlation
Standard dev. of x
Here 1.6
Slope: $2,689.936
Standard dev. of y: 19,233.75
• Hence the correlation between earnings and education is: 0.2240
• It is lower than 1 because the linear relationship doesn’t hold
exactly.
• The r2 is thus: 0.050176
• Notice the variance of the error: Var(error) = (1-R2) x Var(y)
• And thus ! sd(residuals) = sqrt(1-R2) * SD(y)
• We find: $18,744 !!! Same as before !
The Empirical Rule
Frequency
Unconditional
distribution
The conditional
distribution has a lower
standard deviation… a
higher mean than the
unconditional distribution.
Conditional
distribution
Earnings
Where will your earnings lie with 95% probability?
Wrap up
• With a linear relationship y = a + b x + e..
– The unconditional distribution of y has a larger variance than
the conditional (i.e. marginal) distribution of y given x.
• The mean of the conditional distribution of y given x is
a+bx
• And the standard deviation is the standard deviation of the
errors ei.
• Such standard deviation is equal to:
SSE
N
Again, N in the
denominator.
Proper discussion of
this to follow.
Outline
1. Conditional distribution
– What wage will I earn after graduation?
2. Probabilities (Chapter 4)
After the Break:
Probability Distributions
Chapter 4 of A&F
Probability and Luck
• We play a game together…
– Heads you win 1 dirham.
– Tails I win 1 dirham.
• We play the game a very large
number of times.
• Should you play this game?
• P(heads) = 0.5, P(tails) = 0.5
Probability and Luck
• P(heads) = 1 – P(not heads)
• P(heads) is read as “probability of heads”.
• Game sequence:
– In the long run, with a balanced coin, 0.5 of the trials will lead to
heads, 0.5 of the trials will lead to tails.
– The probability of heads is the ratio of the number of heads to the
number of trials, with an infinite number of draws…
Number of heads
P(heads) =
Number of draws
Perform the game for a very
long number of draws.
… the longer the game the
closer the ratio will be to 0.5
Probability and Luck
• What is the probability that you win twice in a
row?
– P(heads in the first round)
* P(heads in the second round) =
– Because the draws in the first and the second
round are independent events.
• What is the probability that you win k times in
a row?
– P(heads in the first round)
* P(heads in the second round)
* …. * P(heads in the kth round) =
Sometimes we can’t
repeat our choices
Life is full of random events… but
• We only draw one job at the end of university.
– Hard to know what other incomes/jobs we would
have gotten.
• We only draw one marriage.
– Subsequent marriages are not identical to the first
one.
– What is the probability of divorce?
• We only die once at a particular age.
– What is the probability of death at age 50?
Sometimes we can’t
repeat our choices
• In such a case we define the probability of an
event as the ratio of the number of such
events over the number of individuals in
identical circumstances.
– … for a very large number of such individuals.
• Example: number of individuals with the same
degree, same age as me:
• What is the probability of earning more than
$45,000 in my first job?
P(earning ³ $45,000)=
Number of individuals earning more than $45,000
Number of individuals identical to me
Wrap Up
• What is the conditional distribution of y given x?
– Use the relationship y = a + b x + e to find the mean of y
given x.
• We compute a and b using our formulas.
– Use the relationship TSS = ESS + SSE:
• the variance of the error is the variance of the y minus the
variance of the prediction.
– The standard deviation of y given x is the standard
deviation of the errors (residuals).
– Apply the empirical rule.
• 95% of the y given x will lie between a + b x +- 2 sd(y given x)
• Beginning probability distributions (chapter 4)
Coming up:
Don’t forget:
• Break of Statistics for 2 weeks.
• Only one week break for recitations.
For help:
• Amine Ouazad
Office 1135, Social Science building
amine.ouazad@nyu.edu
Office hour: Wednesday from 4 to 6pm.
• GAF: Irene Paneda
Irene.paneda@nyu.edu
Sunday recitations.
At the Academic Resource Center, Monday from 2 to 4pm.
Download