CHAPTER 10 Correlation and Regression Objectives • • • • Draw a scatter plot for a set of ordered pairs. Compute the correlation coefficient. Test the hypothesis H0: 0. Compute the equation of the regression line. Compute the coefficient of determination. Section 10.1 Introduction • In addition to hypothesis testing and confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists. Statistical Methods • • Correlation is a statistical method used to determine whether a relationship between variables exists. Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear. Statistical Questions 1. Are two or more variables related? 2. If so, what is the strength of the relationship? 3. What type or relationship exists? 4. What kind of predictions can be made from the relationship? Section 10.2 Correlation I. Scatter Plots • • A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent variable, x, and the dependent variable, y. A scatter plot is a visual way to describe the nature of the relationship between the independent and dependent variables. Example: Construct a scatter plot for the data obtained in a study of the number of hours of sleep and performance. Age x 43 48 56 61 67 70 Pressure y 128 120 135 143 141 152 II. Correlation Coefficient • • • A correlation coefficient is a measure to determine the strength of the relationship between two variables. In a simple relationship, there are only two types of variables under study. In multiple relationships, many variables are under study. 1 Correlation Coefficient • • • • • • • The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is (Greek letter rho). The range of the correlation coefficient is from 1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1. When there is no linear relationship between the variables or only a weak relationship, the value of r will be close to 0. Range of Values for the Correlation Coefficient In general, r > 0.7, there is a positive/negative linear correlation between X and Y. Relationship Between the Correlation Coefficient and the Scatter Plot Formula for the Correlation Coefficient r r n( xy ) ( x)( y ) n x x n y y 2 2 2 2 where n is the number of data pairs. 2 Example: Compute the value of the correlation coefficient for the data obtained in the study of age and blood pressure. Age x Pressure y 43 128 48 120 56 135 61 143 67 141 70 152 r x2 xy y2 n( xy ) ( x)( y ) n x x n y y 2 2 2 2 Possible Relationships Between Variables • • • • • There is a direct cause-and-effect relationship between the variables: that is, x causes y. There is a reverse cause-and-effect relationship between the variables: that is, y causes x. The relationship between the variable may be caused by a third variable: that is, y may appear to cause x but in reality z causes x. There may be a complexity of interrelationships among many variables; that is, x may cause y but w, t, and z fit into the picture as well. The relationship may be coincidental: although a researcher may find a relationship between x and y, commonsense may prove otherwise. Interpretation of Relationships • When the null hypothesis is rejected, the researcher must consider all possibilities and select the appropriate relationship between the variables as determined by the study. Remember, correlation does not necessarily imply causation. Population Correlation Coefficient Formally defined, the population correlation coefficient is the correlation computed by using all possible pairs of data values (x, y) taken from a population. Hypothesis Testing • In hypothesis testing, one of the following is true: H0: 0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1: 0 This alternative hypothesis means that there is a significant correlation between the variables in the population. 3 Formula for the t Test for the Correlation Coefficient • Formula for the t test for the correlation coefficient: with degrees of freedom equal to n 2. • Example: Test the significance of the correlation coefficient found for the data obtained in the study of age and blood pressure. Use 0.05 and r 0.897 Testing the significance of r using Table I • • Table I shows the values of the correlation coefficient that are significant for a specific level and a specific number of degrees of freedom. Any value of r greater than a positive critical value or less than a negative critical value will be significant, and the null hypothesis will be rejected. Example: Using Table I, test the significance of the correlation coefficient r = 0.0667, at and sample size is 9. , 4 Section 10.3 Regression • Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear. Types of Regression and correlation exponential regression perfect correlation positive correlation As x increases, y increases negative correlation no correlation As x increases, y decreases Linear Regression • • If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit. Best fit means that the sum of the squares of the vertical distances from each point to the line is at a minimum. Scatter Plot with Three Lines y A Linear Relation y x x Equation of a Line • • In algebra, the equation of a line is usually given as y = mx + b, where m is the slope of the line and b is the y intercept. In statistics, the equation of the regression line is written as y = a + bx, where a is the y intercept and b is the slope of the line. Formulas for the Regression Line • Formulas for the regression line y = a + bx: a b n x 2 x 2 y x 2 x xy n xy x y n x 2 x 2 5 where a is the y' intercept and b is the slope of the line. Rounding Rule • When calculating the values of a and b, round to three decimal places. Example: Find the equation of the regression line for the data obtained in the study of age and blood pressure. Example: Using the equation of the regression line, predict the blood pressure for a person who is 50 years old. Procedure Finding the correlation coefficient and the regression line equation • Step 1 Make a table with columns for subject, x, y, xy, x2, and y2. • Step 2 Find the values of xy, x2, and y2. Place them in the appropriate columns, • and sum each column. • Step 3 Substitute in the formula to find the value of r. • Step 4 When r is significant, substitute in the formulas to find the values of a and b for the regression line equation . Summary • The strength and direction of the linear relationship between variables is measured by the value of the correlation coefficient r. • • r can assume values between and including 1 and 1. • A value of 1 or 1 indicates a perfect linear relationship. • • • The closer the value of the correlation coefficient is to 1 or 1, the stronger the linear relationship is between the variables. Relationships can be linear or curvilinear. To determine the shape, one draws a scatter plot of the variables. If the relationship is linear, the data can be approximated by a straight line, called the regression line or the line of best fit. Conclusion • Many relationships among variables exist in the real world. One way to determine whether a relationship exists is to use the statistical techniques known as correlation and regression. 6