1 AP Statistics Unit 3- Examining Relationships Name:____________________________________ Date:_____________________________________ Most statistical studies involve more than one variable. Often in the AP Statistics exam, you will be asked to compare two data sets by using side by side boxplots or histograms etc. However, there are times where we want to examine relationships among several variables for the same group of data. For example, is the temperature outside related to the amount of ice cream is sold. Or does the amount of TV watched affect IQ? When you examine the relationship between two variables you need to start with a scatterplot. Scatterplots and Correlation A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point in the plot fixed by the values of both variables for that individual. A response variable measures the outcome of a study. An explanatory variable helps explain or influences change in a response variable. You will often find explanatory variables called independent variables, and response variable called dependent variables. The idea behind this language is that the response variable depends on the explanatory variable. Because the words independent and dependent have other, unrelated meanings in statistics, we won’t use them here. Types of LINEAR plots Perfect + Strong + Moderate + Perfect - Strong - Moderate - 2 Types of NON-LINEAR plots Quadratic Cubic Exponential Natural Logarithmic Quartic None There are two ways of determining whether two variables are related: a) By looking at a scatter plot (graphical approach) b) By calculating a “correlation coefficient” (mathematical approach) Graphical Approach: Interpreting a Scatterplot • • • In any graph of data, look for overall pattern and for striking deviations from that pattern. You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship. An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship. 3 Things to look for in a scatterplot: 1) Overall pattern (linear, exponential, etc.) or deviations from the pattern (outliers) 2) Direction: Positive or negative slope 3) Strength: How close do the points lie to a simple form (such as a line) 4 Ex 1: Suppose we hypothesize that the number of doctor visits a person has can be explained by the amount of cigarettes they smoke. So we want to see if there is a relationship between the number of cigarettes one smokes a week (the explanatory or independent variable) and the number of times per year one visits a doctor (response or dependant variable). We ask 10 random people and get the following information: # of Cigarettes Per Week 0 3 21 15 30 5 40 60 0 0 Number of doctor visits per year 1 2 4 3 5 1 5 6 2 0 If we were to plot the following data point on an X-Y coordinate plane, we would get a scatter plot that looks like this: Smoking Scatter Plot Doctor Visits / year 7 6 5 4 3 2 1 0 0 20 40 60 80 Number of Cigarettes / week Note that the graph shows us that as you smoke more cigarettes / week, you also tend to go to the doctor more often. This result demonstrates that there is a positive relationship between the two variables. The following scatter plots show other types of relationships that might occur between two variables: 5 Mathematical Approach: In order to strengthen the analysis when comparing two variables, we can attach a number, called the correlation coefficient (r), to describe the linear relationship between two variables. This number helps remove any subjectivity in reading a linear scatter plot and helps us avoid being fooled by axis manipulation (see example below): The correlation measures the strength and direction of the linear relationship between two quantitative variables: Linear Correlation Coefficient (r) measures the strength of the linear relationship between the given variables (AKA Pearson’s Product Moment Correlation Coefficient) r= ( n∑ xy − (∑ x )(∑ y ) ) n ∑ x 2 − (∑ x ) 2 ( ) n ∑ y 2 − (∑ y ) 2 ; where n is the number of ordered pairs Or… r= xi − x y i − y 1 ⋅ ∑ n − 1 s x s y r 2 (in percent form) is the percent of variation in the y variable that can be explained by variation in the x variable. r 1 .75 ≤ r < 1 .50 ≤ r < .75 -.50 < r < .50 -.75 < r ≤ -.50 -1 < r ≤ -.75 -1 Type of correlation Perfect + Strong + Moderate + None Moderate Strong Perfect - 6 Interpreting Correlation Coefficients: 1) The value of r is always between –1 and 1. 2) A correlation of –1 implies two variables are perfectly negatively correlated: As one goes up the other goes down and as one goes down the other goes up. 3) A correlation of 1 implies that there is perfect positive correlation between the two variables. As one goes up, so does the other. 4) A correlation of 0 implies that there is no correlation between the two variables. In other words, there is no relationship between them. As one goes up, we know nothing about whether the other will go up or down. 5) Positive correlations between 0 and 1 have varying strengths, with the strongest positive correlations being closer to 1. 6) Negative correlation between 0 and –1 are also of varying strength with the strongest negative correlation being closer to –1. 7) r does not have units. Changing the units on your data will not affect the correlation. 8) Correlation describes only linear relationships between two variables. 9) r is very strongly affected by outliers. Use r with caution when outliers appear in your scatter plot. Don’t rely on r alone to determine the linear strength between two variables – graph a scatter plot first. 10) Correlation makes no distinction between explanatory and response variables. It makes no difference which variable you call x and which you call y when calculating the correlation. Ex 2: Construct a scatterplot and compute the linear correlation coefficient between x and y. x 1 2 3 5 9 11 y 3 5 7 10 16 20 7 Ex 3: Lets calculate the correlation between the amount of cigarettes smoked per week and the number of doctor visits per year: # of Cigarettes Per Week 0 3 21 15 30 5 40 60 0 0 Number of doctor visits per year 1 2 4 3 5 1 5 6 2 0 x= sx = y= sy = x r= x−x sx y−y sy y 0 1 3 2 21 4 15 3 30 5 5 1 40 5 60 6 0 2 0 0 xi − x y i − y 1 ⋅ ∑ n − 1 s x s y So sum up the last column and multiply it by 1 9 r = _____________ Now check your answer by doing it on the calculator. x−x y− y . sx sy 8 Linear Regression AKA- Least Squares Regression Line (LSRL), Line of Best Fit, Linear Regression Equation Least Squares Regression A.K.A. linear regression allows you to fit a line to a scatter diagram in order to be able to predict what the value of one variable will be based on the value of another variable. The fitted line is called the line of best fit, linear regression line, or least squares regression line, (LSRL) and has the form yˆ = a + bx where: a: y intercept b: slope of the line The way the line is fitted to the data is through a process called the method of least squares. The main idea behind this method is that the square of the vertical distance between each data point and the line is minimized. The least squares regression line is a mathematical model for the data that helps us predict values of the response (dependant) variable from the explanatory (independent) variable. Therefore, with regression, unlike with correlation, we must specify which is the response and which is the explanatory variable. There are two sets of formulas that could be used to find the line of best fit: Formula for finding the slope and y-intercept in a linear regression line: Slope: b=r sy Intercept: a = y − bx sx The slope of the regression line is important in the sense that it gives us the rate of change of ŷ with respect to x. In other words, it gives us the amount of change in ŷ when x increases by 1. So for the previous example the process for finding the LSRL is: 1) We found the correlation coefficient is r = 2) We can find s y and s x using the calculator 3) Therefore: b= 4) We can find y and x from the calculator as well. 5) Therefore: a= The equation of the regression line is: 9 The alternate set of formulas that can be used for determining the LSRL is: slope = a = b1 = n(∑ xy ) − (∑ x )(∑ y ) ( ) n ∑ x 2 − (∑ x ) 2 (∑ y )(∑ x ) − (∑ x )(∑ xy ) = n(∑ x ) − (∑ x ) 2 y − int ercept = b = bo 2 2 Ex 4: Using the data from Ex 3, write the equation of the line of best fit using the formula above. Using the Graphing Calculator: Note: You should only need to do steps 2-8, 10 once. 1. Enter x values in L1 and y values in L2 2. Press 2 nd y = 3. Make sure all plots are off, if not, Press 4 (Plotsoff) Enter 4. Press 2 nd y = 5. Select 1 6. Turn this plot on by highlighting On and hitting Enter 7. Press the Down Arrow once and Press Enter 8. Verify that L1 and L2 are listed. 9. Press Zoom 9 (On your graph you should see a scatterplot) 10. 2 nd 0 Press DiagnosticsOn Press Enter twice 11. Press Stat, Right Arrow, 4 (LinReg(ax + b)) 12. L1 , L2 , Y1 Enter (Correlation and Regression data is now presented) 13. Zoom 9 (Line should appear on the graph verifying that equation is correct) Ex 5: Verify your answers to Ex 2-4 using the graphing calculator. 10 Ex 6: The costs of mountain bikes are given in the following table, along with overall rating scores given to each bike by Consumer Reports magazine. Graph the data using a scatterplot. Identify the explanatory and response variable. Interpret the scatterplot that you just constructed on your calculator. Find the strength of the linear correlation. Find the equation of the least squares line. 11 How good is our prediction? The strength of a prediction which uses the LSRL depends on how close the data points are to the regression line. The mathematical approach to describing this strength is via the coefficient of determination. The coefficient of determination gives us the proportion of variation in the values of y that is explained by leastsquares regression of y on x. The coefficient of determination turns out to be r 2 (correlation coefficient squared). Whenever you use the regression line for prediction, also include r 2 as a measure of how successful the regression is in explaining the response. In our example r 2 = 0.8569. This means that over 85% of the variation in doctor visits per year can be explained by the linear relationship it has with cigarettes smoked per week. Residuals: Since the LSRL minimized the vertical distance between the data values and a trend line we have a special name for these vertical distances. They are called residuals. A residual is simply the difference between the observed y and the predicted y. Residual = y - ŷ In our example, if we want to find the residual for the person in our sample who smoked 3 cigarettes per week and had 2 doctor visits, we would need to use the regression equation to find ŷ . Residual = 12 Residual Plots: Residuals help us determine how well our data can be modeled by a straight line, by enabling us to construct a residual plot. A residual plot is a scatter diagram that plots the residuals on the y-axis and their corresponding x values on the x-axis. So for our example, the residual plot would be plotted as follows: x 0 3 21 15 30 5 40 60 0 0 y 1 2 4 3 5 1 5 6 2 0 y hat 1.3069 1.5817 3.2305 2.6809 4.0549 1.7649 4.9709 6.8029 1.3069 1.3069 Residuals -0.3069 0.4183 0.7695 0.3191 0.9451 -0.7649 0.0291 -0.8029 0.6931 -1.3069 -0.0074 Notice that the sum of the residuals is zero (any error is due to rounding) 1.5 1 0.5 0 -0.5 0 20 40 60 80 -1 -1.5 How does a residual plot help us? The pattern of the plot is the indicator of whether or not the data can be modeled linearly. The ideal plot is the one below: The plot shows a uniform scatter of the points above and below the fitted line with no unusual individual observations. 13 Here are a few things to watch out for when examining the pattern of a residual plot: 1) A curved pattern: Shows that the overall pattern of the data is not linear 2) Increasing or decreasing spread about the line as x increases: Predictions for y will be less accurate for larger values of x 3) Individual points with large residuals: indicate outliers from the overall pattern Outlier: An observation that lies outside the overall pattern in the scatter plot (either in the x or y direction) Influential point: A point is influential if removing it would markedly change the position of the regression line. Points that are outliers in the x direction are often influential. 14 In the scatter plot above child 19 is an outlier (its score lies far above the rest of the data) and child 18 is influential because if we were to fit the LSRL without child 18 (dashed line) it would look considerably different than with child 18 (solid line). Child 18, because it is an extreme in the x direction, has a disproportionate pull on the regression line and is therefore an influential point. Important note: Influential points often have small residuals because they tend to pull the line towards themselves. Therefore, you might miss influential points if you only look at residuals. Facts about least-squares regression Fact 1. The distinction between explanatory and response variables is essential in regression. Least-squares regression looks at the distances of the data points from the line only in the y direction. If we reverse the roles of the two variables, we get different least-squares regression line. Fact 2. There is a close connection between correlation and the slope of the LSRL. The slope is : sy b=r sx This equation says that along the regression line, a change in one standard deviation in x corresponds to a change of r standard deviations in y. Fact 3. The LSRL always passes through the point ( x, y ) Fact 4. The correlation r describes the strength of a straight-line relationship. In the regression setting, this description takes a specific form: the square of the correlation, r², is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x. 15 Ex 7: An employer gives an aptitude test to his employees and compares it with their productivity at work. The results are given below: X (Score on Aptitude Test) 6 9 3 8 7 5 8 20 4 11 10 Y (Productivity at Work) 30 49 18 42 39 25 70 40 15 45 52 a) Draw a scatter plot of the data. Label and scale the axes. Is there any pattern? Does the data appear to be positively correlated, negatively correlated, or neither? Which is your response and which is your explanatory variable? b) Find the correlation coefficient, r? Does it support your answer to a? c) Find the regression equation for this data and sketch it onto your scatter plot. d) Discuss any outliers or influential points by visually looking at the scatter plot. 16 e) Remove any potential influential points from your data and recalculate your regression equation. Did they, in fact, make a significant difference to your regression line? f) Construct a residual plot and discuss its implications g) What proportion of the variation in productivity can be explained by the aptitude test? i. Including the influential point ii. Not including the influential point h) A new job applicant scores a 9 on the aptitude test. The employer wants to predict his productivity level. Help the employer (use the regression equation without the influential point) 17 Correlation and Regression Wisdom Extrapolation – is the use of a regression line for prediction outside the domain of values of the explanatory variable x that you used to obtain the line. Such predictions cannot be trusted. Ex 1: The number of people living on American farms has steadily declined in the last 70 years Year Population 1945 32.1 1950 30.5 1955 24.4 1960 23.0 1965 19.1 1970 15.6 1975 12.4 1980 9.7 1985 8.9 1990 7.2 a.) Make a scatterplot of the data and find the LSRL. b.) According to the regression equation, how much did the farm population decline each year on average during this period? c.) What percent of the observed variation in farm population is accounted for by the linear change over time? d.) Use the regression equation to predict the number of people living on farms in 2010. Is this a reasonable result? Why? 18 Other Regressions There are other regressions that can be determined using the graphing calculator. Quadratic Cubic Quartic Exponential Natural Logarithmic Ex 1: 1. A study was conducted to determine if there is a relationship between the number of brownie bites eaten on a daily basis and an individual male’s weight. The data are given below. # of brownies daily Weight in lb 12 120 13 140 22 170 31 195 41 205 53 225 65 250 73 295 83 310 99 325 a. Construct a scatterplot (complete, with appropriate labels) b. Visually determine the best fit model. c. Determine, using your calculator, the best regression model. Why is the one you chose the best? Justify your answer by showing all work. 19 d. Determine the equation of best fit. e. What weight would you expect a person who eats 100 brownies per day to be? f. How many brownie bites per day would you expect a person who is 150 lb to eat? 20 Transforming Relationships Question: What do you do if the data shows a relationship that is not linear? Answer: Try a transformation. Applying a function such as the logarithm or square root to a quantitative transforming variable is called transforming or “re-expressing” the data. Transforming data amounts to changing the scale of measurement that was used when the data was collected. We can choose to measure temperature in degrees Fahrenheit or in degrees Celsius, distance in miles or kilometers. These changes of units are linear transformations. Linear transformations cannot straighten a curved relationship between two variables. Linear transformations also have NO effect on the correlation coefficient (r). To straighten a curved relationship we need to use nonlinear functions like the logarithm or the square root or the cubing function to name a few. Non linear transformations can and do affect r and that is what our goal is; to get r as strong as possible. 21 Ex 1: No tortilla chip lover likes soggy chips, so it is important to find characteristics of the production process that produce chips with an appealing texture. The following data on x = frying time (in seconds) and y = moisture content (%) appeared in the article “Thermal and Physical Properties of Tortilla Chips as a Function of Frying Time” (Journal of Food Processing and Preservation [1995]: 175-189) Frying Time x 5 10 15 20 25 30 45 60 Moisture Content y 16.3 9.7 8.1 4.2 3.4 2.9 1.9 1.3 Sketch a scatterplot of the data, try a few transformations and come up with the best transformed equation. 22 Frequently, an appropriate transformation is suggested by the data. One type of transformation that statisticians have found useful for straightening a plot is a power transformation. A power (exponent) is first selected, and each original value is raised to that power to obtain the corresponding transformed value. Table 5.6 displays a "ladder" of the most frequently used power transformations. The power 1 corresponds to no transformation at all. Using the power 0 would transform every value to 1, which is certainly not informative, so statisticians use the logarithmic transformation in its place in the ladder of transformations. Other powers intermediate to or more extreme than those listed can be used, of course, but they are less frequently needed than those on the ladder. Notice that all the transformations previously presented are included in this ladder. Figure 5.31 is designed to suggest where on the ladder we should go to find an appropriate transformation. The four curved segments, labeled l, 2, 3, and 4, represent shapes of curved scatterplots that are commonly encountered. Suppose that a scatterplot looks like the curve labeled 1. Then, to straighten the plot, we should use a power of x that is up the ladder from the no-transformation row ( x 2 or x 3 ) and/or a power on y that is also up the ladder from the power 1. Thus, we might be led to squaring each x value, cubing each y, and plotting the transformed pairs. If the curvature looks like curved segment 2, a power up the ladder from no transformation for x and/or a power down the ladder for y (e.g., y or log(y)) should be used. The scatterplot for the tortilla chip data has the pattern of segment 3 in Figure 5.31. This suggests going down the ladder of transformations for x and/or for y. We found that transforming the y values in this data set by taking logarithms worked well, and this is consistent with the suggestion of going down the ladder of transformations for y. 23 Ex 2: In many parts of the world, a typical diet consists mainly of cereals and grains, and many individuals suffer from a substantial iron deficiency. The article "The Effects of Organic Acids, Phytates, and Polyphenols on the Absorption of Iron from Vegetables" (British Journal of Nutrition [1983]: 331-342) reported the data in Table 5.7 on x = proportion of iron absorbed and y = polyphenol content (in milligrams per gram) when a particular food is consumed Food Wheat germ Aubergine Butter beans Spinach Brown lentils Beetroot greens Green lentils Carrot Potato Beetroot Pumpkin Tomato Broccoli Cauliflower Cabbage Turnip Sauerkraut x 0.007 0.007 0.012 0.014 0.024 0.024 0.032 0.096 0.115 0.185 0.206 0.224 0.260 0.263 0.320 0.327 0.327 y 6.4 3.0 2.9 5.8 5.0 4.3 3.4 0.7 0.2 1.5 0.1 0.3 0.4 0.7 0.1 0.3 0.2 Sketch a scatterplot of the data, try a few transformations and come up with the best transformed equation. 24 Use your transformed equation to answer the following questions. 1. What is the residual value for Turnips. 2. Suppose Yams had a proportion of iron absorbed was .181, what would be the predicted polyphenol content of Yams. 3. Suppose the polyphenol content was 5.5 for zucchini. Using the transformed LSRL, find the estimated proportion of iron absorbed. 25 Cautions about Correlation and Regression A Lurking Variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables. A lurking variable can falsely suggest a strong relationship between x and y, or it can hide a relationship that is really there. Lurking variables can dramatically change the conclusions of a regression study. Because lurking variables are unrecognized and unmeasured, detecting their effects is a challenge. Many lurking variables change systematically over time. Ex 1: As the number of cell phones has increased, so have the number of accidents on I-95. So there is a strong positive correlation between the amount of cell phones and the number of car accidents. What is the lurking variable in this example? Ex 2: A study of 1000 deaths in California showed that left-handed people died at an average age of 66 while right-handers died at age 75. Should left-handers fear an early death? “Beware the lurking variable” is good advice when thinking about an association between two variables. The example below illustrates Common Response. The observed association between the variables shoe size and math ability is explained by a lurking variable, age. This common response creates an association even though there may be no direct causal link between x and y. Ex 3: If we were to collect data on the shoe size of a child (age 6 to 18) and their Mathematical ability we would probably find a strong positive correlation. This does not mean that having a larger foot causes you to be better in math. It simply means that the third variable, age, causes both shoe size and mathematical ability to increase. Shoe size Math ability Age Two variables are Confounded when their effects on a response variable cannot be distinguished from each other. The confounded variables may be either explanatory variables or lurking variables. When many variables interact with each other, confounding of several variables often prevents us from drawing conclusions about causation. 26 Ex 4: Researchers found that the more schooling a man’s wife had, the less likely he was to be at risk for heart disease. Confounding would occur if the men who married college graduates differ from those who don’t on some variable that is related to heart disease. We know that stress affects the heart. Maybe men who marry college grads have less stress because their wives make good livings and lessen the burden of bill paying. Ex 5: Many studies have found that people who are active in their religion live longer than nonreligious people. But people who attend church or mosque or synagogue also take better care of themselves than non-attenders. They are less likely to smoke, more likely to exercise, and less likely to be overweight. The effects of good habits are confounded with the direct effects of attending religious services. Note: many observed associations are at least partly explained by lurking variables. Both common response and confounding involve the influence of a lurking variable (or variables) z on the response variable y. The distinction between these two types of relationships is less important than the common element, the influence of lurking variables. The most important lesson is: even a strong association between two variables is not by itself good evidence that there is a cause and effect link between the variables. Ex 6: Studies show that there is a correlation between societies that have high milk consumption and an increase in the incidence of cancer in those societies. Can we therefore conclude that high milk consumption causes cancer? Not really; keep in mind that societies that have high milk consumption are generally wealthier and its citizens also have a greater longevity – the longer one lives the more likely one is to contract cancer. Thus we are not sure if the higher incidence of cancer is due to the third factor of longevity or the milk consumption, or some combination of both. Contract cancer Drink Longevity The third factor of longevity is a lurking variable. Here is the moral of story: Correlation/Association Doesn’t Imply Causation!