AMS 5 REGRESSION Regression The idea behind the calculation of the coefficient of correlation is that the scatter plot of the data corresponds to a cloud that follows a straight line. This idea can be formalized by regression methods. In this class we will: • • • • • • Consider the definition of simple linear regression Find a method to predict an individual value Use the normal curve to estimate the percentile ranks Describe the regression effect Compute the regression errors and its RMS Study the behavior of regression errors Regression The regression method describes how one variable depends on another. The Northern California temperature data have average altitude of 3,524 feet and a SD of 1,839 feet; average temperature of 70.3 degrees and SD 6.5 degrees. The correlation between temperature and altitude is 0.76. Regression The idea behind the calculation of the coefficient of correlation is that the scatter plot of the data corresponds to a cloud that follows a straight line. This idea can be formalized by regression methods. In this class we will: • • • • • • Consider the definition of simple linear regression Find a method to predict an individual value Use the normal curve to estimate the percentile ranks Describe the regression effect Compute the regression errors and its RMS Study the behavior of regression errors Regression The cloud of points shows a mild negative association between the two variables, as does the value of r. Can we use the values of altitude to estimate the average values of temperature? Regression How does the regression line work? Associated with an increase of one SD in x there is an increase of r SDs in y on average. Clearly, if the correlation coefficient is negative, then the average value of y decreases as x increases. In the temperature and altitude example, an increase of height of 1,839 feet produces a increase of -0.76 × 6.5 = -4.95 degrees in the average temperature. Regression How do we use the method to predict an individual value? If we consider two variables x and y and we want to predict the value of y for a specific value of x, we use the average value of y that corresponds to the value of x according to the regression method. Example: The first year GPAs and the Math SAT for the students of a university produce the following data average SAT score = 550, SD = 80 average 1st-year GPA = 2.6, SD = 0.6 r = 0.40 We want to predict the 1st-year GPA of a student with a SAT score of 650. Regression The student's SAT score in standard units is 650 − 550 = 1.25 80 so the score is 1.25 SDs above average. An increase of one SD above the average SAT score produces an increase of 0,4 × 0,6 GPA points. This implies that our student will have an increase of 1.25 × 0.4 × 0.6 = 0.3 points of GPA above average. Since the average GPA is 2.6, the predicted GPA is 2.6 + 0.3 = 2.9 This is the average GPA that we expect for students with STA scores around 650. Regression WARNING: You can use the regression method on new subjects provided that they are similar to the ones that were used to produce the averages, SDs and r used in the regression method. In the previous example the method will not be valid for students of a different institution. Regression We can use the regression method and the normal curve to produce estimates of the percentile ranks. Example: In the previous example suppose a student has a percentile rank of 90% for the SAT scores. That is, only 10% of the scores are higher than his. What is the predicted percentile rank for the 1st-year GPA of this student? Using the normal curve we have that a 90% probability corresponds to z score of 1.3. This means that the student's SAT score is 1.3 SDs above average. This corresponds to being 0.4 × 1.3 ≈ 0.5 SDs above the average GPA and this corresponds to an accumulated probability, under the normal curve, of approximately 69%. Regression So the percentile rank on 1st-year GPA of a student with a percentile rank on SAT score of 90% is predicted to be 69%. Notice that the student with a SAT percentile rank of 90% was `pulled down' to only 69% by the regression method. Why is that? Suppose the correlation was perfect, r = 1, then 90% will convert to 90%. The other extreme is that there is no correlation, so, in the absence of any information, the best guess is the median or 50% percentile. The regression method produces a rank that is somewhere between these two extremes. Example The shoe size and the heights of 14 men are recorded. The shoe size average is 10.46 with a SD of 1.21. The average height is 70.45 inches with a SD of 2.45 inches. The correlation is 0.93. What is the average height of a man that uses shoes of size 11.5? We convert 11.5 to standard units 11.5 − 10.46 = 0.859 1.21 so the shoe size is 0.859 units above average. This means that the height will be 0.859 × 0.93 × 2.45 = 1.95 inches above average. So the average height of a man with shoe size 11.5 will be 70.45 + 1.95 = 72.40 inches. Regression effect Galton, a British statistician, studied the relationship between the height of the fathers and the sons in 1,078 families. He noticed that tall fathers tended to have shorter sons and short fathers tended to have taller sons. He termed this fact regression to mediocrity. This is where the term regression comes from. Example: Children are tested for IQ before and after taking a preschool program. In both cases the scores average 100 and the SD is 15. So, on average, there seems to be no effect. Nevertheless children below average in the first test had an average gain of 5 IQ and those above average had an average loss of 5 IQ. This is regression effect. Regression effect A model for the test-retest situation is observed test score = true score + chance error Suppose that the chance error can be either positive or negative. Suppose that the true scores in the population follow the normal curve with an average of 100 and a SD of 15. Consider the children who scored 140 on the first test. There are two possibilities: • true score below 140, with a positive chance error • true score above 140, with a negative chance error Which one is more likely? According to the normal curve, the first possibility is more likely, since the mean is 100 and so the interval above 140 has less probability than the one below 140. Under this scenario, the second test is more likely to produce a value below 140. Regression effect A symmetric situation is valid for those scoring, say , 80 IQ. It is likely that the true test is above 80 with a negative chance error, and so the second score is likely to be above the first. In other words, if a students scores above average in the first test, it is likely that the true score is lower than the observed one. If the student takes the test again, chances are that the second score will be lower than the first. A symmetric situation is true for a person scoring below average in the first test. This explains the regression effect. Regression errors The regression method can be used to predict y from x. But actual values differ from predictions. These are the regression errors. error = actual value of y - predicted value of y Some of the errors defined in this way are positive and some are negative. Reflecting the fact that some observations are above and some are below the regression line. How do we measure the error in a regression? The overall size of the error is measured using the root-meansquare (RMS), as we did to obtain the SD. This is equal to where N is the number of points in the scatter diagram. Regression errors What if we ignore the values of x? Then our prediction for y is the average of y. In this case the RMS error coincides with the SD of y. Computing the RMS error We saw that the error that corresponds to a prediction where the values of x are ignored corresponds to the SD of y. The overall size of the error for a regression using x has to be smaller than the SD. How much smaller? RMS error = 1 − r 2 × SD of y We observe the following features • The units of the RMS error are the same as the units of the variable being predicted. • Perfect correlation corresponds to zero RMS error. • Zero correlation corresponds to maximum RMS error (equal to SD of y). Computing the RMS error Example 1: In the California temperature example we had that the SD of y is 6.5 degrees and the correlation is -0.76, then 1 − 0.762 × 6.5 degrees ≈ 4.22 degrees So, in this case, knowing the altitude reduces the SD from 6.5 to 4.22 degrees. Example 2: In the shoe sizes examples we had that the SD of y is 2.45 inches and the correlation is 0.93, then 2 1 − 0.93 × 2.45 inches ≈ 0.90 inches So we observe that, knowing the shoe size produces a dramatic reduction of the SD from 2.45 to 0.90. Plotting the residuals Prediction errors are usually called residuals. It is important to explore the graphical properties of residuals to find out about the goodness of the fit by the regression line. In a residual plot the x coordinates are the same as for the original data. The y coordinates correspond to the values of the residuals. So there is one point for each point in the original scatter diagram. Plotting the residuals Thus, if everything is OK with the regression line, we expect to see a cloud of points around the zero line in the y axis. • We expect to see no trends or clusters in the residuals • There should be about the same number of positive as negative residuals • A histogram of the residuals should look symmetric around zero Problem The following results are taken from a study of about 1,000 families: average height of husband 68 inches, SD ≈ 2.7 inches average height of wife 63 inches, SD ≈ 2.5, r ≈ 0.25 Predict the height of a wife when the height of her husband is 1. 72 inches The husband is 4 inches above average height. This is 4/2.7 = 1.5 SD above the average. So the wife is predicted to have r × 1.5 = 0.25 × 1.5 ≈ 0.4 this corresponds to 0.4 × 2.5 = 1 inch. 2. 68 inches This the husband is right on the average, so the wife will be right on the average as well. Prediction for data in a vertical strip Example: A law school finds the following relationship between LSAT scores and first-year scores average LSAT score = 162, SD = 6 average first-year score = 68, SD = 10, r=0.60 Q: About what percentage of the students had first-year scores over 75? A: We use the normal curve approximation. Converting to standard units 75 − 68 = 0.7 10 this corresponds to a right hand tail of 14% under the normal curve. Prediction for data in a vertical strip Q: Of the students who scored 165 on the LSAT, about what percentage had first-year scores over 75? A: We first convert to standard units for the x variable: 165 − 162 = 0.5 6 then convert to standard units for the y variable r × 0.5 = 0.6 × 0.5 = 0.3 which corresponds to 0.3 × 10 = 3 points above average or 68+3 = 71. Since the data corresponding to a strip are a smaller and more homogeneous sample, the corresponding SD will be smaller. How much smaller? Prediction for data in a vertical strip Example: A law school finds the following relationship between LSAT scores and first-year scores average LSAT score = 162, SD = 6 average first-year score = 68, SD = 10, r=0.60 Q: About what percentage of the students had first-year scores over 75? A: We use the normal curve approximation. Converting to standard units 75 − 68 = 0.7 10 this corresponds to a right hand tail of 14% under the normal curve. Prediction for data in a vertical strip We expect the dispersion in the y variable to be about the same for each vertical strip. This is given by the RMS error, thus the new SD is 2 2 1 − r × SD of y = 1 − 0.6 ×10 = 8 points This new SD can be used to convert to standard units 75 − 71 = 0.5 8 and, using the normal curve, we obtain an area of 31% above 0.5. This is the percentage of students scoring more than 75 in the first year among those who scored 165 in the LSAT. Notice that this percentage is higher than the 14% we obtained before. This is because we have focus on a smaller portion of the sample, obtaining a smaller SD. Prediction for data in a vertical strip In summary, when considering data for a vertical strip: • Convert to standard units in the x variable. • Obtain the predicted value of the y variable. • Calculate the SD for the y variable in the strip using RMS error. • Convert to standard units in the y variable and use the normal curve. Slope and intercept All lines can be determined by a slope and an intercept. The intercept is the height of the line when x = 0. The slope is the rate at which y increases, per unit increase in x. If the slope is negative then y decreases as x increases. Slope and intercept How do you get the slope of a regression line? Example: A sample of 555 California men age 25-29 in 1993 was surveyed to find out about education and income. The data are summarized by average education ≈ 12.5 years; SD ≈ 4 years average income ≈ $21,500; SD ≈ $16,000; r ≈ 0.35 This means that, for every increase of one SD in education, there is an increase of r SD in income. Thus, 4 extra years of education are worth an extra 0.35 × $16,000 = $5,600 of income. So, each extra year is worth 0.35 × $16, 000 = $1, 400 4 this, is the slope of the regression line. Slope and intercept The intercept of the regression line is given by the value of y when x = 0. This is 12.5 years below average in education. Since each year costs $1,400, a man with no education should have an income which is below average by 12.5 years × $1,400 per year = $17,500 since the average income is $21,500, the income of a man with no education is $21,500 -$17,500 = $4,000. This is the intercept of the regression line. This corresponds to the change in y associated with one unit increase in x. Slope and intercept This is given by average of y - slope × average of x The equation for the regression line is called the regression equation and can be written as y = slope × x + intercept So, for our example, we have that predicted income = $1,400 per year × education + $4,000 Slope and intercept Q: What is the predicted income of a man with an education of 15 years? A: Using the regression equation we have y = $1,400 × 15 + $4,000 = $25,000 we can plug in any value of education and obtain the expected income for that level of education. Warning: It is usually a bad idea to use the regression line for extrapolations. Example Back to our shoe size example. The shoe size and the heights of 14 men are recorded. The shoe size average is 10.46 with a SD of 1.21. The average height is 70.45 inches with a SD of 2.45 inches. The correlation is 0.93. r × SD of height 0.93 × 2.45 = = 1.88 The slope of the regression line is SD of shoe size 1.21 To obtain the intercept we consider a show size of zero. This is 10.46 units below average and so will correspond to a height that is 1.88 × 10.46 = 19.66 inches below average. So it corresponds to a height of 70.45 – 19.69 = 50.75 inches. The regression line is height = 1.88 × shoe size +50.74 inches Q: What is the predicted height of a man with a show size of 9? A: Using the regression equation we have 1.88 × 9 +50.74 inches = 67.67 inches Least Square Consider a cloud of points produced by obtaining the scatter diagram of observations corresponding to two variables x and y. There are many lines that we can draw through the cloud. Which is the straight line that fits the points best? The regression line is a possible solution to this problem. This is the reason why the regression line is called the least squares line. Least Square Example: Let b be the length of a spring with no load. If a load x is attached to the spring the stretch is proportional to x. Thus the length of the string is y = mx + b. where m and b are constants that depend on the string. An experiment is run to determine the constants for a given spring, the data are shown in the table. The correlation coefficient is r = 0.999, so the points are very close to straight line. But they are not exactly on a straight line. This is probably due to measurement error. The regression line for these data produces estimates of b and m, given, respectively, by the intercept and the slope of the line. The values are m ≈ 0.5c per kg, and b ≈ 439.01 cm. These are the least squares estimates of m and b. Problem Find the regression equation for predicting final score from midterm score, based on the following information: average midterm score = 70, SD = 10 average final score = 55, SD = 20 , r = 0.60 The slope of the line can be obtained as r × SD of final 0.60 × 20 = = 1.2 SD of midterm 10 A score of 0 in the midterm will correspond to a final score that is 1.2 × 70 = 84 units below average. So the intercept is 55 – 84 = -29 units of the final score. Thus, the regression equation is final score = 1.2 × midterm score - 29