Alternatively, dependent variable and independent variable. Alternatively, endogenous variable and exogenous variable. Association versus causation Scatterplots Weeks since beginning of semester free in computer labs Percentage of computers used Stata Exercise 1 Suppose we were considering the effect of hiring more people into the firm. On average, what total billings can we expect from a staff of 50? 150? Stata Exercise 2 Stata Exercise 3 Stata Exercise 4 Adding Categorical Values to a Scatterplot Often it is useful to have a way of distinguishing groups of data in a scatterplot Stata Exercise 5 Stata Exercise 6 Transforming Data Data analysts often look for a transformation of the data that simplifies the overall pattern. Stata Exercise 7 The transformation typically involves turning a non-Normally distributed variable into a more-or-less Normally distributed variable. Categorical Explanatory Variable What if the explanation for the numbers is not another number but the category? For example, investing in a particular sector of the economy might be great in some years or terrible in others. Stata Exercise 8 More scatterplots Relations between competitors Stata Exercise 9 Correlation Which one has the stronger correlation? r = covariance(x,y) / [stdev(x)*stdev(y)] r = (1/(n-1)) * sum of [(standardized values of x) (standardized values of] y) week w - mean of w 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4.8 8.5 mean of w stdev of w z-score of w prop of comps 73.1 89.7 71.3 65.3 54.6 57.9 51.6 41.2 59.1 48.5 24 43 29.1 19.7 12.1 10.1 p - mean of p z-score of p 23.1 46.9 mean of p stdev of p sum count corr z-score * z-score 0.00 16 Correlation r = (1/(n-1)) * sum of [(standardized values of x) (standardized values of] y) The r coefficient between measures of height and weight is positive because people who are of above-average height tend to be of above-average weight … so if the z-score for height is large, the z-score for weight tends to be large. Correlation applet at www.whfreeman.com/pbs Stata Exercise 11 Correlation Correlation coefficients, as well as scatterplots can be used for comparisons. For example, how well did Vanguard International Growth Fund (an investment vehicle) do compared to an average of the stocks in Europe, Australasia and the Far East? Stata Exercise 12 Correlation Doesn’t tell you anything about causality Variables must be numerical It is indifferent to units of measurement r>0 means positive association; r<0, negative -1 < r < 1. r = -1 means a perfectly straight downward-sloping line. r=0 means no relation. r only measures linear relations r is not resistant to outliers Stata Exercise 13 Regression The Linear Regression Model yi a bxi errori Errors have a mean 0 and a constant sd of s and are independent of x. 0 1000 2000 3000 Square Footage of Homes Linear prediction Price of Homes 4000 0 1000 2000 3000 Square Footage of Homes Price of Homes Linear prediction 4000 0 50 0 50 100 150 1500<sqft<=2000 Frequency 100 150 1000<sqft<=1500 1000000 500000 Price of Homes 1000000 2500<sqft<=3000 0 0 50 Frequency 100 150 2000<sqft<=2500 0 100150200 500000 Price of Homes 50 0 1000000 500000 Price of Homes 1000000 3500<sqft<=4000 0 0 50 Frequency 100 150 3000<sqft<=3500 0 100 150 500000 Price of Homes 50 0 0 500000 Price of Homes 1000000 0 500000 Price of Homes 1000000 0 1000 2000 3000 Square Footage of Homes Price of Homes Linear prediction 4000 y – 20,000 = 1560 (x - 66.5) Sketch a scatterplot of the data consistent with this line 50000 y = – 84,000 + 1560 x $37,694 (76.5’’, $35,600) (66.5’’, $20,000) 95% of values 0 (61.5’’, $12,200) 55 60 65 earn 70 Height (inches) Fitted values 75 80 0 50000 55 60 65 earn 70 Height (inches) Fitted values 75 80 3 2 0 1 y 0 1 2 x Draw the best-fitting line through the circles 3 4 3 0 1 y 2 0 1 2 3 x Draw the best-fitting line through the circles 4 5 6 3 2 0 1 y 0 1 2 x Mark with an “X” the average “y” value for each “x” value. Then draw the best-fitting line through the Xs 3 4 3 0 1 y 2 0 1 2 3 x Mark with an “X” the average “y” value for each “x” value. Then draw the best-fitting line through the Xs 4 5 6 Fact 1 Regression (unlike correlation) is sensitive to your determination of which variable is explanatory and which response. Item = a + b(sales) Sales = a + b(item) Stata Exercise 14 Facts 2 and 3 If x changes by one standard deviation of x, y changes by r standard deviations of y. – E.g., sx = 1, sy = 2, and r = 0.61. If x changes by 1, y will change by 2*0.61 = 1.22 The regression line goes through the point ( x , y ) – The point-slope form of the line requires only the information on this slide to draw a line. Fact 4 Correlation r is related to the slope of the regression line and therefore to the relation between x and y. Actually, the square of r, that is, R2 is the ( x, y) fraction of the variation in y that is explained by the variation in x. variation in yˆ as x pulls it along the line R total variation in observed values of y 2 Because most of the variation in gas consumption is explained by temperature, the R2 of this regression is very high. tbill98 tbill98_hat 11.5 10.84649 12.6 12.19961 13.8 14.81564 6.4 5.975251 5.3 6.336083 residuals Excel Exercise 1 Stata Exercises 15 and 16 With influential observations Without influential observation 21 Stata Exercise 17 Cautions about Correlation and Regression Don’t extrapolate too far Correlations are stronger for averages than for individuals Beware of lurking (latent, hidden, excluded, neglected) variables Association is not causation – Establishing causation takes a lot of work (see p. 139).