Statistics for Social and Behavioral Sciences Session #9: Linear Regression and Conditional distribution Probabilities (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad Statistics Course Outline PART I. INTRODUCTION AND RESEARCH DESIGN PART II. DESCRIBING DATA Where we are right now! Describing associations between two variables PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL STATISTICS Week 1 Weeks 2-4 Weeks 5-9 Firenze or Lebanese Express? PART IV. : CORRELATION AND CAUSATION: REGRESSION ANALYSIS This is where we talk about Zmapp and Ebola! Weeks 10-14 Last session • How good are my predictions? How good is my model? – Use the R Squared = ESS/TSS. – TSS = ESS + SSE. – The notations TSS, ESS, SSE are widespread. – The variance is the square of the standard deviation. – The R squared is also the square of the correlation of the predicted value and the actual value. Outline 1. Conditional distribution – What wage will I earn after graduation? 2. Probabilities (Chapter 4) After the Break: Probability Distributions Chapter 4 of A&F WHEN LaTisha Styles graduated from Kennesaw State University in Georgia in 2006 she had $35,000 of student debt. This obligation would have been easy to discharge if her Spanish degree had helped her land a well-paid job. But there is no shortage of Spanish-speakers in a nation that borders Latin America. So Ms Styles found herself working in a clothes shop and a fast-food restaurant for no more than $11 an hour. Frustrated, she took the gutsy decision to go back to the same college and study something more pragmatic. She majored in finance, and now has a good job at an investment consulting firm. Her debt has swollen to $65,000, but she will have little trouble paying it off. A Contingency Table (From Previous Session) We will learn how to produce this later in the course. For now, let’s interpret/understan d this. Shows the average weekly earnings for each year of education. • But can I do a regression analysis here? What wage will I earn after graduation? • Data: Census of Population 2010. • The United States Census is a decennial census mandated by Article I, Section 2 of the United States Constitution, which states: "Representatives and direct Taxes shall be apportioned among the several States ... according to their respective Numbers ... . The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years.” • Variables: – Number of years of education completed. – Wage income. • We can only perform regression analysis on quantitative variables. Linear Relationship anybody? • We can postulate that there is a linear relationship between wage income(y) and years of schooling (x). y = a + bx +e • Using greek letters here. True relationship. • Notice the importance of residuals (aka errors) • Units of measurement matter. Make sure you read the fine print. – y is annual income in dollars. x is in years. • Also, with a linear relationship, an additional year of education leads the same increase in income at any stage of your education process. – Makes sense? Check the contingency table. • We keep the linear relationship as a convenient model. Estimation of a and b • We estimate a and b by computing the values of a and b. We only have a sample, not the entire population. • So, what earnings can we expect? Linear? A contigency table y Wage Income b 1 a Years of schooling (… Continued …) • Here, years of schooling (x) is quantitative discrete, so we can do both regression analysis and a contingency table! The unconditional distribution of income and education • We find that the mean and the standard deviation of the variables are as follows: – Annual income y – Years of schooling x mean: $41,550 mean: 12.25 years SD: $48,659 SD: 1.6 years • Assuming a bell shaped distribution – Most earnings will fall between: mean +- 3 sd – 95% of earnings will fall between: mean +- 2 sd – 68% of earnings will fall between: mean +- 1 sd • Interesting: could do a risk analysis with that data: – What is the probability that you earn more than mean + 2 sd? • But the unconditional distribution of annual income mixes both individuals with high and low levels of education… The conditional distribution of income and education • So instead of using the unconditional distribution of income (aka marginal distribution), we use the conditional distribution of income. • “What is the distribution of income given that an individual studied for x years?” Earnings y Education x Understanding the mean of income given x • After x years of education, the predicted (mean) annual income will be: a+bx -123.610 + 2,689.936 x • With x = 16 …. We find $42,915 ! • Good or bad? Understanding the risks: Approach #1 • Use the fact that TSS = ESS + SSE. – The ESS measures how education explains the variance of earnings. • From this we find that Var(y) = Var(predictions) + Var(error). – How do we go from TSS=ESS+SSE to this? • But that is the variance of the unconditional distribution of y. • How can we find the variance of earnings given a level of education? • In such a case Var(y) given a level of education is Var(y given x)=Var(error). • And thus the standard deviation of earnings given a level of education is: – SD(residuals) = square root of (SSE/N) = sqrt (513012622113699.1/1460042) = $18,744 • Applying the empirical rule… we find that most annual incomes will lie between: $ 42,915 - 3 x $18,744 and $ 42,915 + 3 x $18,744 $0 and $99,417 Understanding the risks: Approach #2 • Use our beautiful formula: æs ö r(x, y) = çç x ÷÷ b è sy ø Correlation Standard dev. of x Here 1.6 Slope: $2,689.936 Standard dev. of y: 19,233.75 • Hence the correlation between earnings and education is: 0.2240 • It is lower than 1 because the linear relationship doesn’t hold exactly. • The r2 is thus: 0.050176 • Notice the variance of the error: Var(error) = (1-R2) x Var(y) • And thus ! sd(residuals) = sqrt(1-R2) * SD(y) • We find: $18,744 !!! Same as before ! The Empirical Rule Frequency Unconditional distribution The conditional distribution has a lower standard deviation… a higher mean than the unconditional distribution. Conditional distribution Earnings Where will your earnings lie with 95% probability? Wrap up • With a linear relationship y = a + b x + e.. – The unconditional distribution of y has a larger variance than the conditional (i.e. marginal) distribution of y given x. • The mean of the conditional distribution of y given x is a+bx • And the standard deviation is the standard deviation of the errors ei. • Such standard deviation is equal to: SSE N Again, N in the denominator. Proper discussion of this to follow. Outline 1. Conditional distribution – What wage will I earn after graduation? 2. Probabilities (Chapter 4) After the Break: Probability Distributions Chapter 4 of A&F Probability and Luck • We play a game together… – Heads you win 1 dirham. – Tails I win 1 dirham. • We play the game a very large number of times. • Should you play this game? • P(heads) = 0.5, P(tails) = 0.5 Probability and Luck • P(heads) = 1 – P(not heads) • P(heads) is read as “probability of heads”. • Game sequence: – In the long run, with a balanced coin, 0.5 of the trials will lead to heads, 0.5 of the trials will lead to tails. – The probability of heads is the ratio of the number of heads to the number of trials, with an infinite number of draws… Number of heads P(heads) = Number of draws Perform the game for a very long number of draws. … the longer the game the closer the ratio will be to 0.5 Probability and Luck • What is the probability that you win twice in a row? – P(heads in the first round) * P(heads in the second round) = – Because the draws in the first and the second round are independent events. • What is the probability that you win k times in a row? – P(heads in the first round) * P(heads in the second round) * …. * P(heads in the kth round) = Sometimes we can’t repeat our choices Life is full of random events… but • We only draw one job at the end of university. – Hard to know what other incomes/jobs we would have gotten. • We only draw one marriage. – Subsequent marriages are not identical to the first one. – What is the probability of divorce? • We only die once at a particular age. – What is the probability of death at age 50? Sometimes we can’t repeat our choices • In such a case we define the probability of an event as the ratio of the number of such events over the number of individuals in identical circumstances. – … for a very large number of such individuals. • Example: number of individuals with the same degree, same age as me: • What is the probability of earning more than $45,000 in my first job? P(earning ³ $45,000)= Number of individuals earning more than $45,000 Number of individuals identical to me Wrap Up • What is the conditional distribution of y given x? – Use the relationship y = a + b x + e to find the mean of y given x. • We compute a and b using our formulas. – Use the relationship TSS = ESS + SSE: • the variance of the error is the variance of the y minus the variance of the prediction. – The standard deviation of y given x is the standard deviation of the errors (residuals). – Apply the empirical rule. • 95% of the y given x will lie between a + b x +- 2 sd(y given x) • Beginning probability distributions (chapter 4) Coming up: Don’t forget: • Break of Statistics for 2 weeks. • Only one week break for recitations. For help: • Amine Ouazad Office 1135, Social Science building amine.ouazad@nyu.edu Office hour: Wednesday from 4 to 6pm. • GAF: Irene Paneda Irene.paneda@nyu.edu Sunday recitations. At the Academic Resource Center, Monday from 2 to 4pm.