EXAMINING RELATIONSHIPS Chapter 3 LESSON 3-1 VARIABLES Response variable (dependent) Measures the outcome of a study. Explanatory variable (independent) Attempts to explain the observed outcomes The most common treatment for breast cancer was once the removal of the breast. It is now usual to remove only the tumor and nearby lymph nodes, followed by radiation . The change in policy was due to a large medical experiment that compared two treatments. Some breast cancer patients , chosen at random, were given each treatment. What are the explanatory and response variable? Are they categorical or quantitative ? Explanatory variable Treatment old or new Categorical Response Variables Survival Time Quantitative SCATTERPLOT A scatterplot shows the relationship between two quantitative variables measured on the same individuals . Always plot the explanatory variable (if there is one) on the horizontal axis (x-axis) of a scatterplot. The response variable on the vertical axis (y-axis). If there is no explanatory-response distinction, either variable can go on the horizontal axis. Manatees are large, gentle sea creatures that live along the Florida coast. Many manatees are killed or injured by powerboats. Here are data on power boat registrations (in thousands) and the number of manatees killed by boats in Florida in the years 1977 to 1990: A.We want to examine the relationship between number of power boats and number of manatee killed by boats. Which is the explanatory variable? Explanatory Variable = Number of powerboat registrations MANATEES KILLED BY BOATS IN FLORIDA FROM 1977 TO 1990 Years Powerboats( thousands) Manate es killed 1977 447 13 1978 460 21 79 481 24 80 498 26 81 513 24 82 512 20 83 526 15 84 559 34 85 585 33 86 615 33 87 645 39 88 675 43 89 711 50 Make a scatterplot of these data. (Be sure to label the axes with variable names, not just x and y.) What does the scatterplot show about the relationship between these variables? ZOOMSTAT 9 2nd Y= The plot shows a moderately strong linear relationship. As registrations increase, the number of manatee deaths also tends to increase. Describing A Scatterplot Direction (think slope) Positive Direction Negative Direction Form Linear Relationship Curved Relationship Clusters Scatter How closely the points follow a clear form. Outliers An individual values that falls outside the overall pattern of the relationship A .Describe the direction of the relationship. Are the variables positively or negatively associated? The variables are positively associated; that is, at the number of jet skis in use increases, the number of manatees killed also increase. B. Describe the form of the association. Is it linear? The association is linear. C. Describe the strength of the relationship. Can the number of manatees killed be predicted accurately from power boat registrations? If powerboat registrations remained constant at 719,000, about how many manatees would be killed by boats each year? The association is relatively strong. The number of manatees killed can be predicated accurately from the number of powerboat registrations. If the number of registrations remains constant 719,000, we would expect between 45 and 50 manatees to be killed per year. Calories and hot dogs that are high in calories are also high in salt? The following slide is a scatterplot of the calories and salt content (measured as milligrams of sodium) in17 brands of meat hot dogs. C. Are there any outliers? is the relationship (ignoring outliers) roughly linear in form? Still ignoring outliers, how strong would say the relationship between calories and sodium is? Lower left point is, but ignoring it we have a linear pattern. A .Roughly what are the lowest and highest calories counts among these brands? Roughly what is the sodium level in the brands with the fewest and with the most calories. Lowest: about 107 calories with about 145 mg of sodium Highest: about 195 calories with about 510 mg of sodium B. Does the scatterplot show a clear positive or negative association? Say in words what this association means about calories and salt in hot dogs? There is positive association: high calorie hot dogs tendto be high in salt, and low calorie hot dogs tend to have low sodium. Lesson CORRELATION 3-2 Look at these scatterplots, what can you tell me about them? They are the same graph, I just changed the scale, so be careful when describing things. SO we will use a mathematical number CORRELATION to accurately describe our plots CORRELATION Correlation measures the direction and strength of the linear relationship between two quantitative variables . Correlation is usually written as r. - - 1 xi - x yi - y r= S( )( ) n -1 sx sy The lengths of two bones in five fossil specimens of extinct beast Archaeopteryx. Femur =56 59 64 74 Humerus =63 70 72 84 A. Find the correlation r FACTS ABOUT CORRELATION Correlation makes no distinction between explanatory and response variable. Correlation requires both variables to be quantitative. Correlation (r)itself has no unit of measurement; its just a number. Positive (r) indicates positive association. Negative (r) indicates negative association. The correlation (r) is always between -1 and 1 . FACTS ABOUT CORRELATION The closer (r) is to +1 , the stronger the evidence of positive association between two variables. The closer (r) is to -1 , the stronger the evidence of negative association between to variables. If (r) is close to 0, does not rule out any strong relationship between xand y, there could still be a strong relationship but one that is not linear. Correlation is strongly ef fected by a outlying observations. POSITIVE LINEAR CORRELATION Perfect Positive Linear Correlation r=1 Strong Positive Linear Correlation r≈ Weak Positive Linear Correlation r ≈ .4 NEGATIVE LINEAR CORRELATION Perfect Negative Linear Correlation r = -1 Strong Negative Linear Correlation r ≈ - .9 Weak Negative Linear Correlation r ≈ - .4 NO LINEAR CORRELATION No Linear Correlation r is close to 0 DESCRIBING THE STRENGTH OF A LINEAR RELATIONSHIP Strong -1 moderate -.8 -.5 weak 0 moderate .5 .8 strong 1 Do people with larger brains have higher IQ scores? A study looked at 40 volunteer subjects, 20 men and 20women. Brain size was measured by magnetic resonance imagining. Table 3-3 gives the data. the MRI count is the number of “pixels” the brain covered in the image. IQ was measured by the Wechsler test. A) Make a scatterplot of IQ score versus MRI count, using distinct symbols for the mean and women. In addition find the correlation between IQ and MRI for all 40 subjects for the men alone and for the women alone. r (all) = .3576 r (men) = .4984 r (women) = .3257 B) Men are larger than women on the average, so they have larger brains. How is this size effect visible in your graph? The points for mean are generally located on the right side of the plot , while the women’s points are generally on the left. C) Your result in (b) suggests separating men and women in looking at he relationship between brain size and IQ. Use your work in (a) to comment on the nature and strength of this relationship for women and for men. The correlation for men and women suggests that there is a moderately positive association for men and a weak one for women. However, one significant feature of the data that can be observed in the scatterplot is that the sample group was highly stratified; that is, there were 10 men and 10 women with high IQs (at least 130), while other 10 of each gender had IQs of no more than 103. The men’s higher correlation can be attributed partly to the two subjects with large brains and 103 IQs (which are relative to the low IQ group). The men’s correlation might not remain so high with a larger sample. LEAST SQUARE REGRESSION 3-3 BEST FIT LINE Is a straight line (equation) that describes how a response variable y changes as an explanatory variable x changes. A best Fit Line (equation) is used to predict the value of y for a given value of x. Best Fit Line unlike correlation, requires that we have an explanatory variable and a response variable. BEST FIT LINE Is the line that comes closer to all the points. y is the actual value ^ y residual is the predicted value. Residual is the difference between the observed value and the associated predicted value. Negative residual shows a model that is overestimate Positive residual shows a model value that is underestimate. “BEST FIT” MEANS LEAST-SQUARES The line of “Best Fit” is the line for which the sum of the squared residuals is the smallest The line of “Best Fit” is called a Leastsquares Regression Line (LSRL) Equation of LSRL The line must go through the point ^ Equation y = b0 - b1 x Slope b1 = r Intercept _ sy sx _ b0 = y- b1 x - - (x, y) Keeping water supplies clean requires regular measurement of levels of pollutants. The measurements are indirect –a typical analysis involves forming a dye by a chemical reaction with the dissolved pollutant, then passing light through the solution and measuring its “absorbance. ”To calibrate such measurements, the laboratory measures known standard solutions and uses regression to relate series of data on the absorbance for different levels of nitrates. Nitrates are measured in milligrams per liter of water Nitrates 50 50 100 200 400 800 1200 1600 2000 2000 Absorbance 7.0 7.5 12.8 24.0 47.0 93.0 138.0 183.0 230.0 226.0 r = .9999 xn = 840 _ A. What is the equation of the least-square lines for predicating absorbance from concentration? Find our slope & intecept s 90.953 b1 = r y = .9999( ) = .1133 sx 802.704 _ _ b0 = y- b1 x 96.83-(.1133)(840)= 1.658 ^ y =1.658 +.1133x sn = 802.704 _ y A = 96.83 sA = 90.953 How do I get these from calculator? 2 var stats If the lab analyzed a specimen with 500 milligrams of nitrates per liter, what do you expect the absorbance to be? Based on your plot and the correlation, do you expect your predicted absorbance to be very accurate? ^ y =1.658 +.1133x ^ y =1.658 +.1133(500) =58.31 This prediction should be very accurate since the relationship is so strong ASSESSING THE ACCURACY OF THE LINEAR MODEL The standard deviation and r² are numerical measures use in the assessment of how well the model fits. r² -Coefficient of Determination Measure of the proportion of variability in the y variable that can be “explained” by the linear relationship between x and y 100*r² is the percentage of variation in y that can be attributed to approximate linear relationship between x and y. Is a number between 0 and 1, the close r² is to 1, the better the line describes how the change in the explanatory variable affects the value of the response variable. Meaning a better fit. r² is the square of the linear correlation coefficient for the least-square regression model. A natural measure of variation about the least-square regression line is the sum of the squared residuals. Measures the amount of variation in y that cannot be explained by the linear relationship between x and y. A study of class attendance and grades among first year students at a state university showed that in general students who attended a higher percent of their classes earned higher grades. Class attendance explained 16% of the variation in grade index among students. What is the numerical value of the correlation between percent of classes attended and grade index? r = .16 2 r 2 = .16 r = .40 Higher attendance goes with high grades, so the correlation must be positive Some people think that the behavior of the stock market in January predicts its behavior for the rest of the year. Take the explanatory variable x to be the percent change in a stock market index January and the response variable y to be the change in the index for the entire year. We expect a positive correlation between x and y because the change in January contributes to the full year’s change. Calculation from data for the years 1960 to 1997 gives _ x = 1.75% _ y = 9.07% sx = 5.36% r = .596 sy = 15.35% A. What percent of the observed variation in yearly changes in the index is explanatory by a straight-line relationship with the change during January. r = .596 r 2 = .355 = 35.5% B. What is the equation of the least-squares line for predicting full-year change from January change? _ s .1535 b1 = r y = .596( ) = 1.707 sx .0536 _ _ b0 = y- b1 x C. The mean change in January isx =1.75% . Do you have to use the regression line to predict the change in the index in a year in which the index rises 1.75% in January. _ =.0907 (1.707)(.0175) = 0.06083 = 6.083% yˆ 6.083% 1.707x NO. The predicted change is .y = 9.07% Since the _regression line must pass _ through x, y. ASSUMPTIONS AND CONDITIONS Check the scatterplot The shape must be linear or we can’t use regression at all. Watch out for outliers. Outlying values have large residuals and squaring makes their influences that much greater. Outlying points can dramatically change a regression model. They can change the sign of the slope, misleading us about the underlying relationship between variables. A r²of 100% You may have accidentally regressed two variables that measure the same thing. Don’t extrapolate beyond the data. A linear model will often do reasonable job of summarizing a relationship in the narrow range of observed x-values. Beware of predicting y-values for x-values that lie outside the range of the original data. If you must extrapolate into the future, at least don’t believe that the prediction will come true! Don’t infer that x causes y just because there is good linear model for their relationship. Correlation and regression describe only linear relationships. You can do calculations on any two quantitative variables, but results are only useful if scatter plot is linear Correlation is not resistant. Look for unusual observations. RESIDUALS A residual is the dif ference between an observed value of the response variable and the value predicted by the regression line. residual = observed – predicted _ = y- y Plotting the residual A residual plot is a scatterplot of the ( x, residual) pairs Residual plot is a good place to start when assessing the appropriateness of the regression line. Residuals will always sum up to be 0. S0, the residuals mean will equal 0 also. CALCULATE RESIDUAL We have a study of fat gained dues to change in excercise Fat gain=3.505 - .00344Change One subjects Change rose by 135 calories. That subject gained 2.7 kg of fat. The predicted fat gain on 135 calories is what. 3.505 - .00344(135) = 3.04 kg So the residual is actual – predicted 2.7 - 3.04 = -.34 kg Residuals If your residual is positive, what has happened? That means you have value that exceeds the prediction, overestimate What about when it is Negative? Then our actual does not get to the predicted value. We underestimate. RESIDUAL PLOT Determine whether a linear model is appropriate to describe the relationship between the explanatory and response variables. Residual are what is “left over” after the model describes the relationship, they often reveal subtleties that were not clear from a plot of the original data. Determine whether the variance of the residuals is constant . Check for outliers RESIDUAL PLOTS –UNIFORMED The uniform scatter of points indicated that the regression line is good model RESIDUAL PLOTS –CURVED The residuals have a curved pattern, so a straight line is an inappropriate model RESIDUAL PLOTS – INCREASING/DECREASING The response variable y has more spread for larger values of the explanatory variable x, so prediction will be less accurate when x is large. INFLUENTIAL OBSERVATIONS An outlier is an observation that lies outside the overall pattern of the other observation. An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x directions of a scatterplot are often influential for the least-square regression line . RESIDUAL PLOT –INDIVIDUAL POINTS Individual points with large residual (Child 19) are outliers. Individual points that are extreme in the x direction (Child 18)are influential observations Lean body mass as a predictor of metabolic rate. Exercise 3.12, page 132 provides data from a study of dieting for12 women and 7 men subjects. We explore the data further. Type in mass to L1 and rate to L2. A. Make a scatterplot B. Perform least-squares regression on your calculator and record the equation and the correlation. Lean body mass explains what percent of the variation in metabolic rate for women? Lean body mass explains about 76.82% of the variation in metabolic rate. C. Does the least-square line provide an adequate model for the data? We will make this on the calculator and look at the pattern Graphing Residuals Turn on Stat Plot 1 Y1 → Vars/Y-Vars/Function ZOOM 9 C. From the residual plot, the line does appear to provide an adequate model. The residual are scattered about the horizontal axis and no patterns are evident.