1 DISPLAYING THE RELATIONSHIP DEFINITIONS: Studies are often conducted to attempt to show that some explanatory variable “causes” the values of some response variable to occur. The response or dependent variable is the response of interest, the variable we want to predict, and is usually denoted by y. The explanatory or independent variable attempts to explain the response and is usually denoted by x. A scatterplot shows the relationship between two quantitative variables x and y. The values of the x variable are marked on the horizontal axis, and the values of the y variable are marked in the vertical axis. Each pair of observations (xi, yi), is represented as a point in the plot. Two variables are said to be positively associated if, as x increases, the values of y tends to increase. Two variables are said to be negatively associated if, as x increases, the values of y tends to decrease. When a scatterplot does not show a particular direction, neither positive, nor negative, we say that there is no linear association. 2 Scatterplot of Final vs Midterm Scores Final The 10th student (21 , 38) Midterm Student Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 X Midterm Score 39 44 32 40 45 46 33 39 32.5 21 30 39 44 28.5 38 43 42 25.5 47 36 31.5 32 42 21 41 Y Final Score 62 69 68 86 88.5 88.5 76 66.5 75 38 71 88 96.5 71.5 96 82.5 85 28 95 39 58 49 62 59 90 3 Let's Do It! 1 The data below was obtained in a study of age and systolic blood pressure of six randomly selected subjects. Make a scatter plot to examine the relationship between (x) = age and (y) = pressure. Comment on the relationship with respect to form, direction, strength, and any departures or usual values. Subject A B C D E F Age x 43 48 56 61 67 70 Pressure y 128 120 135 143 141 152 4 Notes of Caution 1. An observed relationship between two variables does not imply that there is some causal link between the two variables. For example, consider the following scatter-plot of IQ score versus shoe size: IQ Shoe Size As a person ages their shoe size increases as well as their IQ. Although there is a positive association, there is no causal link between the two variables shoes size and IQ. Most studies attempt to show that some explanatory variable "causes" the values of the response to occur. While we can never positively determine whether or not there is a distinct cause-and-effect relationship, we can assess if there appears to be such relationship. 2. A relationship between two variables can be influenced by confounding variables. Consider the following scatter-plot of the number of sport magazines read in a month versus the height of the person: Number of magazines read : women : men Height Overall there appears to be a positive association between height and number of magazines. However, if for each gender, there does not appear to be an association. Gender is a confounding variable and aggregating the data across gender can result in misleading conclusions. Any study, especially an observational study, has the potential to be wrongly interpreted because of confounding variables. 5 3. Unusual data points (outliers) can mislead the association, especially if the data set is small. Consider the following scatter-plot of the percentage of people who speak English versus population size. Percent who speak English Outlier Population Size The eight points in the scatter-plot represents eight countries from Central and South America selected at random. The outlier is Mexico City. 4. Sometimes a scatter plot, such as the one in Figure below, shows a curvilinear relationship between the data. In this situation, Methods for curvilinear relationships are beyond the scope of this course. 6 Simple Linear Regression Scatterplot of Final vs Midterm Scores Final Line #1 Line #2 Midterm So the question remains as to how to find a “best-fitting” line? Equation of a Line y=a+bx where b = slope - the amount y changes when x is increased by 1 unit. a = y-intercept - the value of y when x is set equal to zero. 7 DEFINITION:: The least squares regression line, given by y a bx , is the line that makes the sum of the squared vertical deviations of the data points from the line as small as possible. Performing the regression is often stated as regress y on x . Least squares regression line for regressing final exam scores, . x. on midterm exam scores, x , is given by y 7.5 175 y, Estimated slope of b =1.75 tells us that for a 1-point increase on the midterm we would expect, on average, an increase of 1.75 points on the final exam. Estimated y -intercept of a =7.5 tells us that if someone were to score 0 points on the midterm, we would predict they would get 7.5 points on the final exam. Suppose a new student scores 40 points on the midterm. Based on our model, what would be their predicted final exam score? Plug the value of x =40 into our estimated equation. The predicted . (40) 77.5 points. final Exam score is y 7.5 175 8 Let's Do It! 2 13.2 Childhood Growth The growth of children from early childhood through adolescence generally follows a linear pattern. Data on the heights of female Americans during childhood, from four to nine years old, were compiled and the least squares regression line was obtained as y 80 6 x , where y is height in centimeters and x is age in years. Note that 1 inch is equal to 2.54 centimeters. (a) Interpret the value of the estimated slope b = 6. (b) Would interpretation of the value of the estimated y intercept, a = 80, make sense here? If yes, interpret it. If no, explain why not. (c) What would you predict the height to be for a female American at 8 years old? Give your answer first in centimeters then in inches. (d) What would you predict the height to be for a female American at 25 years old? Give your answer first in centimeters then in inches. (e) Why do you think your answer to part (d) was so inaccurate? 9 Calculating the Least Squares Regression Line The Least Squares Regression Line The least squares regression line is given by y a bx where xi x yi y n xi yi xi yi slope = b 2 2 n xi2 xi xi x y – intercept = a y bx Example Test 1 versus Test 2—Obtaining the Regression Line “By Hand” (a) Look at the relationship graphically with a scatter-plot to confirm initially that a linear model seems appropriate. 10 (b) Calculate the estimated regression line by completing the calculation table shown below. b n xi yi xi yi n xi2 xi 2 5 884 60 70 220 11 .. 200 5 760 60 2 a y bx Least squares equation: 70 60 11 . 0.8. 5 5 0.8 11 y . x. Slope of the line is b = 1.1. This means that Test 2 scores are expected to go up by 1.1 points on average for each additional point scored on test 1. A student who scored 15 points on Test 1 is predicted to score y 0.8 11 . (15) 17.3 points on Test 2. 11 Test 1 versus Test 2—Obtaining the Regression Line Using the TI Calculator To obtain the least squares regression line using the TI graphing calculator we would first need to enter the data. L1 8 10 12 14 16 L2 9 13 14 15 19 Enter the values of the quantitative variable x = Test 1 into L1 and enter to corresponding value of the quantitative variable y=Test 2 into L2. To get the least squared regression equation we use the following sequence of buttons Your output screen should provide the least squares regression equation as y=a+bx with the y-intercept of a=0.8 and the slope of b=1.1. Caution: There are two linear regression options-namely LinReg(ax+b) and LinReg(a+bx). We request the latter option, which uses b to represent the slope. 12 Let's Do It! 13.3 Oil-Change Data The table below presents data on x = the number of oil changes per year and y = the cost of repairs for a random sample of 10 cars of a certain make and model, from a given region. (a) Make a scatter-plot of the points as a check for linearity and outliers. Comment on your plot. (b) Find the least squares regression line for regressing cost on number of oil changes. Describe what the estimated yintercept and estimated slope represent. (c) Use your least squares regression line to predict the cost of car repairs for a car that had four oil changes. Homework page11.3 546: 1, 13, 21, 23, 30, 32, 38, 60, 62, 65, 93, 97 13 11.4 STATISTICALLY SIGNIFICANT RELATIONSHIP? Researchers must rely on data from only a sample in order to assess if a relationship exists between two variables. Even though a relationship may be apparent in the sample, it is possible that it will not extend to the population. Researchers use statistics to assess the significance of an observed relationship by measuring the chance that a relationship as strong or stronger would be observed, assuming there really is no relationship in the population. Think About It Slope of Zero Consider the equation of a line for relating two variables: yˆ a bx . Suppose the y -intercept, a , is equal to 10 and slope, b , is equal to 0. What would be the value of the response, ŷ , if x were equal to 2? What would be the value of the response, 12? ŷ , if x were equal to What would be the value of the response, ŷ , for any value of x ? What would it mean if the slope for the regression line for a population were equal to 0? 14 The hypothesis of interest in linear regression is: Main Hypothesis: The slope of the linear regression line using all of the population values is equal to 0 (i.e. the linear relationship is insignificant). Alternative Hypothesis: The slope of the linear regression line using all of the population values is NOT 0 (i.e. the linear relationship is significant). After obtaining the regression line, one should test the significance of the components of the line (mainly, the slope b .)We should remember that we are testing for a linear relationship between x and y . It is possible that x may determine y nonlinearly. The test of the main hypothesis above is called the Slope F-Test of Significance. Microsoft EXCEL output: Regression of final exam score, y , on midterm exam score, x . Lines 12 through 14 are used to assess the significance of the respective coefficients. Line 14 presents information about the slope. The p-value in Line 14 of 0.0001 is used to assess whether the estimated slope is statistically significantly different from zero. Line 13 presents information about the 13 is used to test whether the y -intercept. The p-value in line y -intercept for the population linear regression is equal to 0. 15 Example Service Time A computer-repair technician recorded data on the number of computers serviced and the amount of time to complete the service for 11 randomly selected service visits. 200 Service Time 180 160 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 Number of Computers Serviced x = number of computers serviced and (a) y = time to complete the service What type of association, if any, does the scatter plot show? Output from SPSS: (b) Obtain the estimated linear regression equation: yˆ a bx Note: The column with the heading contains the values for b and a SPSS reports the estimated slope first, in the *row, and the yintercept information in the **row. yˆ 10.19 24.83 x 16 (c) Is there evidence of a significant (non-zero) linear relationship between the number of computers serviced and the service time? Explain. Note: p-value is nearly 0 (p-value < 0.00005) => The number of computers serviced appears to be a significant linear predictor of service time. (d) Predict the number of minutes required for service when it is reported that 5 pc's are down. y 1019 . 24.83(5) 134.3 minutes Let's Do It! Size of Homes and Selling Price There are many factors that affect the selling price of a home. The total dwelling size and the assessed value are just two factors. Data were gathered on homes in a Milwaukee, Wisconsin, neighborhood. Scatter-plots revealed a linear relationship between the total dwelling size of a home in 100 square feet and its selling price in dollars. Below is the regression output for the least squares regression of selling price on total dwelling size. Dep var: PRICE N: 20 Multiple R: .913 Squared multiple R: .834 Adjusted squared multiple R: .825 Standard error of estimate: 3377.192 Variable CONSTANT SIZE Source Regression Residual Coefficient 11947.010 2749.622 Std error 4748.133 288.980 Analysis of Variance Sum-of-squares DF .103257E+10 1 .205298E+09 18 Std coef Tolerance 0.000 . 0.913 .100E+01 Mean-square .103257E+10 .114054E+08 F-ratio 90.533 T 2.516 9.515 P(2 tail) 0.022 0.000 P 0.000 (a) How many homes were included in this study? (b) Obtain the least squares regression line for predicting selling price from the size of the home. (c) Is there a significantly (nonzero) linear relationship between price and size? Explain. 17 (d) The total dwelling size for another home in this neighborhood is 1620 square feet. Use the least squares regression equation to estimate the selling price of this home. 18 11.7 CORRELATION: HOW STRONG IS THE LINEAR RELATIONSHIP? DEFINITION: The sample correlation coefficient r measures the strength of the linear relationship between two quantitative variables. It describes the direction of the linear association and indicates how closely the points in a scatter-plot are to the least squares regression line. Features of the correlation coefficient. 1 r 1 1. Range 2. Sign The sign of the correlation coefficient indicates direction of association — negative [1 , 0) or positive (0 , +1]. 3. Magnitude The magnitude of the correlation coefficient indicates the strength of the linear association. If the data follow a straight line r 1 (if the slope is positive) or r 1 (if the slope is negative), indicating a perfect linear association. If r 0 then there is no linear association. 4. 5. Measures Strength The correlation only measures the strength of the linear association. Unit-less The correlation is computed using standard scores of the two variables. It has no unit of measure and the absolute value of r will not change if the units of measurement for x or y are changed. The correlation between x and y 19 is the same as the correlation between y and x. Some Pictures.... y x x x x x x x x x x x x x x Positive, moderate to strong linear r 0.8 . x association, y x x x x x x x x x x x x x x r 0.2 x Negative, weak linear association, x A strong association, just not a linear one, y x x x r 0. x x x x x Let's Do It! 113.8Matching Graphs The scatter-plot #1 to the right yields a regression line of y = -2.6 + 1.1x and a correlation of r = 0.84. 20 Using this information as a base, match each of the four scatterplots below to the correct description of its regression line and correlation coefficient. The scales on the axes of the scatter-plots are the same. How to Calculate the Correlation Coefficient r Dep var: PERCENT N:18 Multiple R: .840 Squared multiple R: .706 Adjusted squared multiple R: .688 Standard error of estimate: 10.547 Variable Coefficient Std error CONSTANT 96.681 6.289 LENGTH -5.970 0.963 Std coef Tolerance T 0.000 . -0.840 .100E+01 P(2 tail) 15.373 0.000 -6.201 0.000 Analysis of Variance Source Sum-of-squares DF Mean-square F-ratio P Regression 4276.908 1 4276.908 38.448 0.00 Residual 1779.815 16 111.238 21 “Multiple R: 0.840” = absolute value of the correlation coefficient r. The sign of r can be determined by looking at the sign of the slope, which here is -5.970. Correlation between length of putt and percentage of putts made is r 084 . . The formula: r n xi y i xi y i n xi2 xi n y i2 y i 2 2 Example Test 1 v e r s us Test 2 Obtaining t he Correlation Coefficient “By Hand” We already have computed the summation quantities needed for finding r, shown in the calculation table. Completed Calculation Table Total: r xi yi xi2 xi yi yi2 8 10 12 14 16 9 13 14 15 19 64 100 144 196 256 72 130 168 210 304 81 169 196 225 361 x i y 60 i n xi yi xi yi n x 2 i x i 2 n y 2 i x 70 y i 2 2 i 760 x y i i 884 5(884) (60)(70) 5(760) (60) 2 5(1032) (70) 2 y 2 i 1032 0.965 22 The large positive correlation coefficient and the scatter-plot indicate a strong, positive, linear association between Test 1 and Test 2 scores. Let’s Do It! 2 Birth Rates We gathered data from 1970 for twelve nations on the percentage of women aged 14 or older who were economically active and the crude birth rate. (We define the crude birth rate as the number of births in a year per 1000 population size) We are interested in the relationship of the crude birth rate (y) on the percentage of women who were economically active (x) a. Create the scatter-plot. Determine if there is a positive, negative, or association between x and y. Nation Algeria Argentina Denmark E. Germany Guatemala India Ireland Jamaica Japan Philippines USA Soviet Union x 2 19 34 40 8 12 20 20 37 19 30 46 y 48 21 14 11 41 37 22 31 19 42 15 18 b. Find the equation of the regression line. Interpret the slope. c. Find the correlation coefficient r. 23 Obtaining the Correlation Using the TI To get the regression line and the correlation coefficient using the TI we first need to turn on the diagnostic option. If the x data is in L1 and the y data is in L2, then the steps are as follows: Let’s Do It! 3 Birth Rates Check the value of r you obtained in activity2 above using TIcalculator. 24 Let’s Do It! 4 Data on Milk Production Milk samples were obtained from 14 Holstein-Friesian cows, and each was analyzed to determine uric acid concentration (Y), measured in mol/L. In addition to acid concentration, the total mild production (X), measured in kg/day, was recorded for each cow. The data was entered into a computer and the following regression output was obtained. (a) What is the equation of the least squared regression line? (b) What is the correlation between x and y? r = __________. (c) We are interested if the linear relationship is significant by testing the following hypothesis. Main Hypothesis: The slope of the regression line equals 0. Circle the p-value in the output that is used to test this hypothesis. At the level of significance of 0.05, we would (circle one): Accept H0 Reject H0 Can’t Tell 25 THE SQUARED CORRELATION r 2 — WHAT DOES IT TELL US? r = correlation coefficient, gives the strength and the direction of the linear relationship between two quantitative variables x and y; –1 r 1. Note that when we square r we get => 0 r2 1. The value of r2 Is the percentage of variation of dependent variable that are explained the independent variable x. The quantity r2 is generally denoted in computer output as R2, and is often reported as a percent. r2 = 0.75 => about 75% of the variation in the response variable y can be explained by the linear relationship between x and y. Homework page 546 : 2,3, 4 ,14,15,22, 36, 37, 39 26