Chapter 5 Summarizing Bivariate Data Suppose we found the age and weight for each person in a sample of 10 adults. Is There does there any relationship between the age not appear and weight of these adults? Weight to be a relationshi Do you think there is a p between Create arelationship? scatterplot of the data below. If so, what age and kind? If not, why not? weight in adults. Age 24 30 41 28 Wt 256 124 320 185 Age 50 46 49 35 20 39 158 129 103 196 110 130 Weight Suppose we found the height and weight for each person in a sample of 10 adults. Is there any relationship between the height and weight of these adults? Do you think there is a Create arelationship? scatterplot of the data below. If so, what Height kind? If not, why not? Is it positive or negative? Weak or strong? Ht 74 65 77 72 68 60 62 73 61 64 Wt 256 124 320 185 158 129 103 196 110 130 Correlation feature(s) of the graph • The What relationship between bivariate would variables indicate a weak or strong numerical relationship? – May be positive or negative – May be weak or strong What does it mean if the relationship is positive? Negative? Identify the strength and direction of the following data sets. Set A Set D Set B Set C Set D shows a strong, Set Set A shows a strong, positive positive curved B shows little or no Set linear C shows a weaker (moderate), relationship. relationship. relationship. negative linear relationship. Identify as having a positive relationship, a negative relationship, or no relationship. 1. Heights of mothers and heights of their adult daughters + 2. Age of a car in years and its current value 3. Weight of a person and calories consumed + 4. Height of a person and the person’s birth month 5. Number of hours spent in safety training and the number of accidents that occur no - Correlation Coefficient (r)• A quantitative assessment of the strength and direction of the linear relationship in bivariate, quantitative data What are are these These the the z• Pearson’s sample correlation is used values called? scores for x and y. most • Population correlation coefficient - r (rho) • statistic correlation coefficient – r • Equation: xi x yi y 1 r n 1 s x s y Example 5.1 For the six primarily undergraduate universities in California with enrollments between 10,000 and 20,000, six-year graduation rates (y) and student-related expenditures per full-time students (x) for 2003 were reported as follows: Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 Create a scatterplot and calculate r. 38.5 33.9 Example 5.1 Continued Expenditures 8011 7323 8735 7548 7071 8248 Graduation Rates Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 r = 0.05 In order to interpret what this number tells us, let’s investigate the properties of Expenditures the correlation coefficient Properties of r (correlation coefficient) 1) legitimate values are -1 < r < 1 No Correlation Strong correlation Moderate Correlation Weak correlation -1 -.8 -.5 0 .5 .8 1 Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Suppose that the graduation rates were changed from percents to decimals (divide by 100). Transform the graduation rates and calculate r. Do the following transformations and calculate r 2) value of r isr not changed by = 0.05 1) x’ = 5(x + 14) Ittransformation is the same! Why? any linear 2) y’ = (y + 30) ÷ 4 Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Suppose we wanted to estimate the expenditures per student for given graduation rates. Switch x and y, then calculate r. r = 0.05 3) value of r does not depend on which It is the same! of the two variables is labeled x Expenditures 8011 7323 8735 7548 7071 8248 Graduation rates 64.6 53.0 46.3 42.5 38.5 63.9 33.9 Graduation Rates Graduation Rates Plot a revised scatterplot and find r. Suppose the 33.9 wasr =REALLY 0.42 63.9. What do you think would happen to the value of the Extreme affect the by Expenditures Expenditures correlation coefficient? 4) value of rvalues is affected correlation coefficient extreme values. Find the correlation for these points: x -3 -1 1 3 5 7 9 Y 40 20 8 4 8 20 40 Compute the correlation coefficient? r = 0 Sketch the scatterplot r = 0, but the data y set this has amean definite Does that there is NO relationship! relationship 5) value of r is abetween measure these of the extent to which x points? and y are linearlyx related Recap the Properties of r: 1. legitimate values of r are -1 < r < 1 2. value of r is not changed by any transformation 3. value of r does not depend on which of the two variables is labeled x 4. value of r is affected by extreme values 5. value of r is a measure of the extent to which x and y are linearly related Example 5.1 Continued Expenditures 8011 7323 8735 7548 7071 8248 Graduation Rates Graduation rates 64.6 53.0 46.3 42.5 38.5 33.9 Interpret r = 0.05 There is a weak, A quantitative assessment of the positive, linear strength and direction ofrelationship the linear between In order to interpret r, recall the relationship between bivariate, expenditures and definition of the correlation graduation rates. quantitative data Expenditures coefficient. Does a value of r close to 1 or -1 mean that a change in one variable cause a change in the other variable? Consider the following examples: Causality can we only shown by carefully Should allbedrink more hot • The relationship between the number of controlling values of the alland variables that chocolate to lower crime rate? cavities in a child’s teeth the size of Both arebe responses to the coldones weather might related to his or her vocabulary is strong andunder study. In other words, with a wellpositive. controlled, well-designed experiment. These variables areIboth strongly So does this mean should feed related the age the child • Consumption ofto hot chocolate is negatively children more candy toof increase their correlated with crime rate. vocabulary? Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation What is the objective of regression analysis? objective of regression analysis is to • x –The variable: is the independent Suppose that weabout have two use information one variables: variable, x, to or explanatory variable draw some sort of a conclusion about a second variable, y. x = the amount spent on advertising • y- variable: is the dependent or y = the amount of sales for the product during response variable a given period • We will use values of x to What question might I want to answer using predict values ofdata? y. this The LSRL is yˆ a bx Scatterplots frequently exhibit a linear pattern. When this is the case, it makes sense to summarize the relationship - (y-hat) means the predicted y ŷ between the variables by finding a line that as close b – is the slopeas possible to the plots in the plot. The theBeline that the – itLSRL is theis approximate amount by the whichhat y sure tominimizes put This is of done bywhen calculating the line of best fit increases x increases by 1 unit sum the squares ofon the deviations the y or Least Square Regression Line (LSRL). a – is the y-intercept from the line – it is the approximate height of the line x x y y when x = 0 The slope of the LSRL is b Let’s 2 x x explore has no – in some situations, the y-intercept meaning The intercept of the LSRL is a what y bthis x means . . . Suppose we have a data set that consists just fit aof Find sum Now find the ofLet’s thethe observations (0,0), (3,10) and 6,2). line to the of the squares vertical data byfrom these distance drawing a line deviations. each point to through what the y =.5(0) + line. 4 = 4 appears to be the middle of 0 – 4 = -4 the points. (3,10) y =.5(6) + 4 = 7 4.5 2 – 7 = -5 y =.5(3) + 4 = 5.5 -4 (0,0) 10 – 5.5 = 4.5 yˆ .5 x 4 -5 (6,2) Sum of the squares = 61.25 What is the sum of the deviations from the line? Will it always be zero? (3,10) 6 Find the vertical deviations from the line The line that minimizes the sum of the squares of the deviations from the line is-3 the LSRL. (0,0) Use a calculator to Find the sum of the find the line of best squares of the fit deviations from the line 1 yˆ x 3 3 -3 (6,2) Sum of the squares = 54 Researchers are studying pomegranate's antioxidants properties to see if it might be helpful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3) x 11 15 19 23 27 y 150 270 450 580 740 Sketch a scatterplot for this data set. Pomegranate study continued Remember that an interpretation is stating x = number of days afterininjection of cancer cells in mice the definition assigned to plaincontext. water and y = average tumor volume x 11 15 19 23 27 yaverage 270 positive, 450 740 The volume of the580 tumor increases by There is a150 strong, linear relationship approximately 37.25tumor mm3 for eachand daythe between the average volume increasenumber in the number daysinjection. after injection. of daysofsince Calculate the LSRL and the correlation coefficient. yˆ 269 .75 37 .25 x r 0.998 Interpret the slope and thehave correlation Does the intercept meaning in this coefficient in context. context? Why or why not? Pomegranate study continued This is the danger of x = number of days after injection of cancer cells in mice extrapolation. The leastassigned to plain water and y = average tumor volume x y squares line should not be 11 to 15 23 27for used make 19 predictions y 150 using 270 x-values the 450 outside 580 740 range in the data set. yˆ 269 .75 37.25 x Why? Predict the average volume of the tumor for 20 days It after injection. whether the pattern is unknown 3 observed in the scatterplot ˆ y 269 .75 37 .25(20 ) 475 .25 mm continues outside the range of xPredict theCan average volume of the tumor for 5 volume be negative? days values. after injection. 3 ˆ y 269 .75 37 .25(5) 83 .5 mm Pomegranate study continued the of slope theinjection line forofpredicting x =No, number daysof after cancer cellsxinismice assigned to plain water s y tumor volume s x and y = average r not r sy x 11 15 19 23 s x 27 and the almost y intercepts 150 270are450 580always 740 different. yˆ Here 269 .75appropriate 37.25 x regression line: is the Suppose we want to know how many days after injection of cancer cells would the average tumor size be 500 mm3? The regression y onappropriate x should not be used to Is line thisofthe predictregression x, because it is not the line that line to answer minimizes the sum of the squared deviations in thisx question? the direction. xˆ 7.277 .027 y Pomegranate study continued x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume x 11 y 150 Will 19 the 23point 27of averages always be on 270 450 580 740 the regression line? 15 Find the mean of the x-values (x) and the mean of the y-values (y). x = 19 and y = 438 + Plot the point of averages (x,y) on the scatterplot. Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x 4 5 6 7 8 y 2 5 4 6 9 Sketch a scatterplot. Calculate the LSRL and the correlation coefficient. yˆ 3.8 1.5 x r 0.916 Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x 4 5 6 7 8 5 y 2 5 4 6 9 8 SupposeWhat we addhappened? the point (5,8) to the data set. What happens to the regression line and the correlation coefficient? yˆ 3.8 1.5 x r 0.916 yˆ 1.15 1.17 x r 0.667 Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x 4 5 6 7 8 12 y 2 5 4 6 9 12 SupposeWhat we addhappened? the point (12,12) to the data set. What happens to the regression line and the correlation coefficient? yˆ 3.8 1.5 x r 0.916 yˆ 2.24 1.225 x r 0.959 Let’s investigate how the LSRL and correlation coefficient change when different points are added to the data set Suppose we have the following data set. x 4 5 6 7 8 12 y 2 5 4 6 9 0 SupposeWhat we addhappened? the point (12,0) to the data set. What happens to the regression line and the correlation coefficient? yˆ 3.8 1.5 x r 0.916 yˆ 6.26 0.275 x r 0.248 The correlation coefficient and the LSRL are both measures that are affected by extreme values. Pomegranate study revisited x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume x 11 15 19 23 27 y 150 270 450 580 740 Minitab, a statistical software package, was used to fit the least-squares regression line. Part of the resulting We will discuss what these numbers output is shown below. mean in the Chapter 13. slope The regression equation is intercept Predicted volume = -269.75 + 37.25 days Predictor Coef SE Coef T P Constant -269.75 23.421412 -11.51724 0.0014 37.25 1.181454 31.52895 0.000 Days Assessing the fit of the LSRL Oncequestions the LSRLare: is obtained, the next Important is toanexamine how effectively the 1. Is step the line appropriate way to summarize line summarizes the relationship the relationship between x and y. between x and y. 2. Are there any unusual aspects of the data set that we need to consider before We will proceeding to use the line to make look at graphical predictions? and 3. If we decide to use the line as a basisnumerical for methods to prediction, how accurate can we expect predictions based on the line to be? answer these questions. In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris measured regression in meters. Minitab was(x). usedDistances to fit thewere least-squares line. From the partial output, identify the regression x 6.94 5.23 5.21 7.10 line. 8.16 5.50 9.19 9.05 9.36 y 0 6.13 11.29 14.35 12.03 22.72 20.11 Predictor Coef SE Coef Constant -7.69 13.33 Distance to debris 3.234 1.782 S=8.67071 R-Sq = 32.0% 26.16 30.65 P Plot Tthe data, -0.58 0.582 including the 1.82 0.112 regression line. R-Sq(adj) = 22.3% yˆ 7.69 3.234x In a study, researchers were interested in how the distance a deer mouse will travelThe for food (y) is related vertical If the deviation point is to the distance from the are foodcalculated to the nearest pile of fine between the point Residuals by above the line, woody debris (x). Distances were measured in meters. and the LSRL is y Distance traveled x subtracting the predicted y from the residual will called thepositive. residual. If the point is the below observed y. be 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36 the line the residual 0 6.13 11.29residual 14.35 12.03 22.72 20.11 26.16 30.65 ˆ y y will be negative. Distance to debris In a study, researchers were interested in how the LSRL to calculate the distance a Use deer the mouse will travel for food (y) is related Subtract to find predicted distance traveled. does this remind you of?pile of fine to the distance fromWhat the food to the nearest the residuals. woody debris (x). Distances were measured in meters. Distance from debris Distance traveled (y) Predicted distance traveled (yˆ) Residual (y yˆ) 6.94 0.00 14.76 -14.76 5.23 6.13 9.23 -3.10 5.21 11.29 9.16 2.13 7.10 14.35 15.28 -0.93 8.16 12.03 18.70 -6.67 5.50 22.72 10.10 12.62 9.19 20.11 22.04 -1.93 9.05 26.16 21.58 4.58 9.36 30.65 22.59 8.06 What does the Willofthe sum sum theof the residuals residuals always equal? equal zero? Residual plots • Is a scatterplot of the (x, residual) pairs. • Residuals can also be graphed against the predicted y-values • The purpose is to determine if a linear model is the best way to describe the relationship between the x & y variables • If no pattern exists between the points in the residual plot, then the linear model is appropriate. Residuals Residuals x This residual shows no pattern so it indicates that the linear model is appropriate. x This residual shows a curved pattern so it indicates that the linear model is not appropriate. In a study, researchers were interested in how theUse the values in this distance a deer mouse will travel for food (y) is related table to to the distance from the food to the nearest pile of finea create woody debris (x). Distances were measured in meters. residual plot Distance from debris Distance traveled (y) Predicted distance traveled (yˆ) Residual (y yˆ) 6.94 0.00 14.76 -14.76 5.23 6.13 9.23 -3.10 5.21 11.29 9.16 2.13 7.10 14.35 15.28 -0.93 8.16 12.03 18.70 -6.67 5.50 22.72 10.10 12.62 9.19 20.11 22.04 -1.93 9.05 26.16 21.58 4.58 9.36 30.65 22.59 8.06 for this data set. Is a linear model appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food? Plot the residuals against the distance from debris (x) 15 Residuals 10 5 5 -5 6 7 8 9 Distance from debris -10 -15 Now plot the residuals against the predicted distance from food. Since the residual plot displays no pattern, a linear model is appropriate for describing the relationship between the distance from debris and the distance a deer mouse will travel for food. 15 Residuals 10 5 10 -5 15 20 9 25 Predicted Distance traveled -10 What do you notice about the general scatter of points on this residual plot versus the residual plot using the xvalues? 15 -15 Residual plots can be plotted against either the x-values or the predicted y-values. Residuals 10 5 5 -5 -10 -15 6 7 8 9 Distance from debris Let’s examine the following data set: The following data is for 12 black bears from the Boreal Forest. This point is considered an point because it x = age (in years)influential and y = weight (in kg) affects the placement of x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5 you anything unusual thenotice least-squares WhatDo would happen to the regression Y 54 40 62 51 55 56regression 62 42 about this40 data59set?51 50 line if this point isline. removed? Sketch a scatterplot with the fitted regression line. 60 Weight 55 Influential observation 50 45 40 Age 5 10 15 20 25 30 Let’s examine the following data set: The following data is for 12 black bears from the Boreal Forest. x = age (in years) and y = weight (in kg) x 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5 An observation is anNotice that this observation has a Y outlier 54 40 42 residual. 40 59 51 50 if it62has51a 55 56 62large large residual. 60 Weight 55 50 45 40 Age 5 10 15 20 Predicted Distance traveled 25 30 Coefficient of determination• Denoted by r2 • gives the proportion of variation in y that can be attributed to an approximate linear relationship between x & y Let’s explore the meaning of r2 by revisiting the deer mouse data set. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food 6.94 5.23 5.21 y 0 6.13 11.29 7.10 8.16 Why dosquares” we y the 15.938 square So this is the What isdeviations? total amount of total variation sumtraveled of squares. in the distance (yvalues)? Hint: Find the sum of the SSTo y y 2 9.19 14.35 12.03 22.72 20.11 Suppose you didn’t know any xvalues. What distance would you expectSS deer mice to travel? stands for “sum of squared deviations. 5.50 Distance traveled x 9.05 9.36 26.16 30.65 30 25 20 15 10 5 5 6 7 8 Distance to Debris 9 Total amount of variation in the distance traveled is 773.95 m2. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food 6.94 5.23 5.21 y 0 6.13 11.29 7.10 8.16 5.50 9.19 14.35 12.03 22.72 20.11 Now suppose you DO know the x-values. Your best guess would be the predicted distance traveled (the point on the LSRL). By how much do the observed points vary from the LSRL? Hint: Find the sum of the residuals squared. SSResid y yˆ 2 9.05 9.36 26.16 30.65 Distance traveled x Distance to debris The points vary from the LSRL by 526.27 m2. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food x 6.94 5.23 5.21 y 0 6.13 11.29 7.10 8.16 5.50 9.19 14.35 12.03 22.72 20.11 9.05 9.36 26.16 30.65 SSResid r 1 SSTo 526.27 2 r 1 0.320 773.95 Approximately what percent Or approximately of the variation in distance 32% traveled can be explained by the regression line? Total amount of variation in the distance traveled is 773.95 m2. The points vary from the LSRL by 526.27 m2. 2 Partial output fromthe thevalues regression analysis of deer and mouse Let’s review from this output data: their meanings. Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 1.82 0.112 Distance to debris S = 8.67071 What does this 3.234 1.782 number represent? R-sq = 32.0% R-sq(adj) = 22.3% The standard The y-intercept The slope deviation (b): (a): (s): 2 The coefficient of determination (r )an it This is the typical amount by which This value has no meaning in context since The distance traveled to food increases by Only 32% ofmake the observed in theof observation deviates from thean least squares doesn't sense to variability have a negative approxiamtely 3.234 meters for increase distance1traveled can explained regression line. It’sbe found by: by the distance. meter tofor thefood nearest debris pile. approximate linear relationship between the SSResid e distance traveled for sfood and n - 2 the distance to the nearest debris pile. Let’s examine this data set: x = representative age Because of the curved Since this curve resembles a parabola, a Using Minitab:finishpattern, a straight line y = average marathon quadratic function can be usedtime to would not The least-squares quadratic regression is accurately describe this relationship. relationship Age 15 25 35 describe 45 the55 65 2 between average finish ˆ y a b x b x Time yˆ302.38 462 193.63 141.2x 185.46 20.179198.49 x 2 time224.30 and age.288.71 This curve minimizes the sum of the squares of the residuals (similar to least-squares linear regression). Average Finish Time Create a scatterplot for this data set. 300 250 200 10 20 30 40 50 Representative Age 60 Let’s examine this data set: Average Finish Time x = representative age y = average marathonHere finish is time the residual plotSince there Notice the residuals from theis no pattern in the Age 15 25 35 45 the 55 65 residual plot, quadratic quadratic regression. is an appropriate model Time 302.38 193.63regression 185.46 198.49 224.30 288.71 for this data set. 300 20 Residuals 250 200 10 -10 10 20 30 40 50 Representative Age 60 -20 10 20 30 40 50 60 Age Let’s examine this data set: x = representative age The measure R2 is useful for y = average marathon finish time assessing the fit of the quadratic regression. Age Time 15 25 35 45 55 65 SSResid 2 R 1 302.38 193.63 185.46 198.49 SSTo224.30 288.71 Average Finish Time R2 = .921 300 250 200 10 20 30 40 50 Representative Age 60 92.1% of the variation in average marathon finish times can be explained by the approximate quadratic relationship between average finish time and age. Depending on the data set, other regression models, such as cubic regression, may be used. Statistical software (like Minitab) is commonly used to calculate these regression models. Another method for fitting regression models to non-linear data sets is to transform the data, making it linear. Then a least-squares regression line can be fit to the transformed data. Commonly Used Transformations Transformation No transformation Equation yˆ a bx Square root of x yˆ a b x Log of x * yˆ a b log10 x Reciprocal of x Log of y * Exponential growth or decay 1 yˆ a b x log10 yˆ a bx *Natural log may also be used Pomegranate study revisited: x = number of days after injection of cancer cells in mice assigned to .2% Since PFE and = average tumor they data appears to volume be exponential growth, x 11 15 19let’s23 27 “log 31of y” 35 39 try the transformation y 40 75 90 210 230 330 450 600 Sketch a scatterplot for this data set. Average tumor volume 600 500 400 300 200 100 10 15 20 25 Number of days 30 35 There appears to be a curve in the Let’s use a data transformation points. to linearize the data. Pomegranate study revisited: x = number of days after injection of cancer cells in mice assigned to .2% PFE and y = average tumor volume x 11 Log(y) 15 19 1.60 1.88 1.95 23 27 2.32 2.36 31 2.52 35 39 2.65 2.78 Log of Average tumor volume Sketch a scatterplot of the log(y) and x.Notice that 3 2 log yˆ 1 10 15 20 25 Number of days 30 35 the relationship now appears linear. Let’s The LSRL fit is an LSRL to the 1.226 0.041x transformed data. Pomegranate study revisited: x = number of days after injection of cancer What wouldcells the in mice assigned to .2% PFE and y predicted = average tumor volume average tumor size 27 31 be 3530 39 days after injection Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78 of cancer cells? Sketch a scatterplot of the log(y) and x. x 11 15 19 The LSRL is 3 Log of Average tumor volume 23 2 1 10 10 10 15 20 25 30 Number of 2525 3030 days 35 3535 log yˆ 1.226 0.041x log yˆ 1.226 0.041(30) log yˆ 2.456 2.456 3 ˆ y 10 285.76mm Another useful transformation is the power transformation. The power transformation ladder and the scatterplot (both below) can be used to help determine what type of transformation is appropriate. Power Transformation Ladder Power Transformed Value Name 3 (Original value)3 Cube 2 (Original value)2 Square 1 (Original value) ½ Original value No transformation 1/3 0 -1 3 Original value Log(Original value) 1 Original value Square root Cube root Logarithm Reciprocal Suppose that the Suppose looks that the scatterplot like scatterplot looks like the curve labeled 1. the curve labeled 2. Then we would use a Then we would power that is upuse thea power that up no the ladder fromisthe ladder from the transformation row no for transformation row both the x and y for the x variable and a variables. power down the ladder for the y variable. Logistic Regression (Optional) • Can be used if the dependent variable is categorical with just two possible values • Used to describe how the probability of Theas graph of this equation “success” changes a numerical predictor For any value of x, the has an “S” shape. variable, x, changes value of p is always • With p denoting between 0 andthe 1. probability of success, the logistic regression equation is p e a bx 1e a bx Where a and b are constants In a study on wolf spiders, researchers were interested in what variables might be related to a female wolf spider’s decision to kill and consume her partner during courtship or mating. Data was collected for 53 pairs of courting wolf spiders. (Data listed on page 287) What This equation is the probability can be used of to x = the difference in body width (female – male) predict cannibalism the probability if the maleof & the female male y = cannibalism; coded 0 for no cannibalism and 1 for cannibalism Note that the plot was constructed so that if two spider spiders cannibalized the same width based on plots fellbeing in theare exact same location they would be offset adifference little bitasoscatterplot that pointsand would Minitab was used to construct to fit a the (difference ofinall 0)? size. visible (called jittering). logistic regression tobe the data. p p e 3.089043.06928x 1 e 3.089043.06928x e 3.089043.06928( 0) 1e 3.089043.06928( 0 ) 0.044