Statistics for Social and Behavioral Sciences Session #6: The Regression Line C’ted (Agresti and Finlay, Chapter 9) Prof. Amine Ouazad Statistics Course Outline PART I. INTRODUCTION AND RESEARCH DESIGN PART II. DESCRIBING DATA Where we are right now! Describing associations between two variables PART III. DRAWING CONCLUSIONS FROM DATA: INFERENTIAL STATISTICS Week 1 Weeks 2-4 Weeks 5-9 Firenze or Lebanese Express? PART IV. : CORRELATION AND CAUSATION: REGRESSION ANALYSIS This is where we talk about Zmapp and Ebola! Weeks 10-14 Last Session • From a scatter plot to a linear relationship – A linear relationship is a model, imperfect. – A linear relationship implies constant gradients. – A linear relationship helps predict/extrapolate, interpolate to fill missing statistics. • Finding the regression line – The regression line minimizes the sum of squared errors. – The formula for a and b are essential to learn. Outline 1. The Regression Line (C’ted) – Last time’s recap – Why we call it regression 2. Warning: Correlation is not causation – Spurious relationships – Being agnostic about causality: correlation 3. How well does the linear model perform? Next session: Bivariate analysis Chapter 9 of A&F, continued Finding the regression line • Which line is the right one? • A line is entirely determined by the choice of a and b. An essential formula. N å(x - x )(y - y ) i b= i a = y - bx i=1 N 2 (x x ) å i x is the explanatory variable y is the response variable i=1 Notice the difference between b and b, between a and a. If y increases when x increases, then b>0 If y decreases when x increases, then b<0 Why do we call this regression? Sir F. Galton • “Regression towards mediocrity in Hereditary Stature”, Sir Francis Galton, 1886. What are y,x,b here? Understanding Galton: Questions • A little exercise to understand Sir Francis: 1. What is the data? How many observations? What is y? What is x? 2. Write the assumed linear relationship between y and x. 3. Can you express the mean of y? (as a function of the mean of x) 4. Take the difference between child i’s height and children’s mean height. 5. How does it relate to the difference between child i’s parents’ midheight and the the mean of parents’ midheight? Sir F. Galton I use mean and average interchangeably in this course. Same formula Outline 1. The Regression Line (C’ted) – Last time’s recap – Why we call it regression 2. Warning: Correlation is not causation – Spurious relationships – Being agnostic about causality: correlation 3. How well does the linear model perform? Next session: Bivariate analysis Chapter 9 of A&F, continued “More than a fifth of people on unemployment benefits have a criminal record, government figures have revealed. The new data showed an estimated 22 per cent of all people claiming out of work claimants - such as Jobseeker’s Allowance - were made by people who had been to prison or convicted of an offence in the previous 12 years.” Chris Grayling, the Justice Secretary, is pushing through reforms which aim to provide more support to offenders who are released from jail back into the community. Jeremy Wright, the justice minister, said: “We are committed to delivering long-needed changes that will see all offenders released from prison receive targeted support to finally turn themselves around and start contributing to society.” Unemployment and Crime “The figures also showed 44 per cent of offenders were claiming benefits a month after being convicted, cautioned or released from jail.” “More than half of offenders - 54 per cent - released from prison were claiming out-of-work benefits one month later, gradually decreasing to 42 per cent two years after.” “In all, 214,000 people claiming out-of-work benefits had been to prison at least once in the previous 12 years, or 4 per cent of the total.” “Previous data published in 2011 estimated the proportion of criminal claimants was slightly higher, at 26 per cent, but a Ministry of Justice spokesman said the sets of figures were not directly comparable.” Chris Grayling, Justice Secretary (UK) Association is not causation What Drives Obesity? Is higher obesity due to the rise in driving? Perhaps. It’s an intriguing hypothesis. But our friends at The Economist should know better than to report nonsensical correlations. Here’s the evidence they cite (drawn from this entirely unconvincing research paper published in Transport Policy): Looks impressive, right? (Well, apart from putting the explanatory variable on the vertical axis.) But before concluding that there’s anything here, let’s try a different variable, instead—my age: Reading is an important skill, and elementary school teachers have observed that the reading ability of their students tends to increase with their shoe size. To help boost reading skills, should policymakers offer prizes to scientists to devise methods to increase the shoe size of elementary school children? Obviously, the tendency for shoe size and reading ability to increase together does not mean that big feet cause improvements in reading skills. Older children have bigger feet, but they also have more developed brains. This natural development of children explains the simple observation that shoe size and reading ability have a tendency to increase together—that is, they are positively correlated. But clearly there is no relationship: bigger shoe size does not cause better reading ability. In economics, correlations are common. But identifying whether the correlation between two or more variables represents a causal relationship is rarely so easy. Countries that trade more with the rest of the world also have higher income levels—but does this mean that trade raises income levels? People with more education tend to have higher earnings, but does this imply that education results in higher earnings? Knowing precise answers to these questions is important. If additional years of schooling caused higher earnings, then policymakers could reduce poverty by providing more funding for education. If an extra year of education resulted in a $20,000 a year increase in earnings, then the benefits of spending on education would be a lot larger than if an extra year of education caused only a $2 a year increase. Economists need statisticians Association is not causation • The response variable may be the explanatory variable and vice verse (reverse causation). • There may be other factors that affect the response variable, other than the explanatory variable. ☞ Multivariate statistics coming up in week 12. Univariate statistics Bivariate statistics Multivariate statistics Inspecting the distribution of one variable. Discovering associations between 2 variables. Uncovering causality: looking at the impact of multiple explanatory variables on one response variable Am I taller than the average? Than the median? What percentile of the distribution do I belong to? Weeks 1 and 2 What is the relationship between parents’ height and children’s height? What is the relationship between unemployment and crime? Now and next week What factors cause crime? Poverty, unemployment, guns, police headcounts? Week 12 The correlation of two variables • The correlation of two variables is: 1 N (xi - x )(yi - y) å N r(x, y) = i=1 sx sy A sum of N observations: fortunately a computer will usually do it (Stata) • The correlation does not make an assumption about the direction of causality (The slope does) • It is, however, related to the slope: æs ö r(x, y) = çç x ÷÷ b è sy ø Standard dev. of x Slope Correlation Standard dev. of y An Example: Unemployment and Murders – The Sequel • • • • • Standard deviation of Unemployed Persons: 5,901.259 Standard deviation of Murders: 20.44 Regression line: we find b = 0.00285 and a = -1.96 The correlation r(Unemployed, Murders) is: 0.83. Self-check? Properties of the correlation • The correlation is a number between -1 and 1, sometimes (but rarely) expressed as a percentage. • If two variables have a correlation of 1, we say that they are perfectly correlated… – Example: student expenses in USD are perfectly correlated with student expenses in AED. – y is exactly a+b x, with b>0. • If two variables have a correlation of -1, the two variables are exactly such that y = a + b x, with b<0. – Example: Number of days to New Year’s eve, Number of days from New Year’s eve. Outline 1. The Regression Line (C’ted) – Last time’s recap – Why we call it regression 2. Warning: Correlation is not causation – Spurious relationships – Being agnostic about causality: correlation 3. How well does the linear model perform? Next session: Bivariate analysis Chapter 9 of A&F, continued How “good” are our predictions? Aouch: we make errors. The actual yi And the predicted yi, noted: y The regression line minimizes the sum of the squared errors: Remember the formula for b and a. When does a model predict y perfectly? When does the model have no predictive power? The regression line. x Playing with the R Squared • The R Squared is : • Answers the question(s): – “What fraction of the variance of the response variable is explained by the explanatory variable?” – “What percentage of the variance of the response variable is explained by the explanatory variable?” • Measures the fit of the linear model. • The R squared is also the square of the correlation between x and y ! R2=r2 An Example: Unemployment and Murders – The Sequel An Example: Unemployment and Murders – The Sequel An Example: Unemployment and Murders – The Sequel • The variance of the predicted number of murders is: 284.3 • The variance of the actual number of murders : 417.8 • The R Squared is: Not bad !! • Side question: what is the variance of the errors (residuals)? Remember: variance(y) = variance(prediction) + variance(error) Follow my lead, it’s easier Wrap up • Finding the regression line (Sir Galton) – The regression line minimizes the sum of squared errors. – The formulas for a and b are essential. • Association is not causation – Does x cause y or does y cause x? – Is there any other factor that may cause y? – Being agnostic about the direction of causality: the correlation r. • How good are my predictions? How good is my model? – Use the R Squared, know its formula. – The variance is the square of the standard deviation. Next session: Minority Report continues Don’t forget: • Midterm 1 coming up in week 5 (exact date coming soon from the Registrar Mary Downes). • Online Quiz #3 starting tonight at 9pm, due Tuesday at 9am. • Sunday recitation on: “The Regression Line: ‘Education and Economic Growth.’” • In chapter 9, read everything except Section 9.5 (Inferences for the Slope) For help: • Amine Ouazad Office 1135, Social Science building amine.ouazad@nyu.edu Office hour: Wednesday from 4 to 5pm. • GAF: Irene Paneda Irene.paneda@nyu.edu Sunday recitations. At the Academic Resource Center, Monday from 2 to 4pm.