Daniel S. Yates The Practice of Statistics Third Edition Chapter 4: More about Relationships between Two Variables Copyright © 2008 by W. H. Freeman & Company Case Study: A matter of life and Death Age Monthly Premium 40 $29 45 $46 50 $68 55 $106 60 $157 65 $257 1. How much would a 58-yr-old expect to pay for such a policy? 2. A 68-yr-old? 4.1 Transforming to achieve Linearity: Today’s Objectives Explain what is meant by transforming data. Discuss the advantages of transforming nonlinear data. Tell where y=log(x) fits into the heirarchy of power transformations. Explain the ladder of power transformations. Explain how linear growth differs from exponential growth. Identify real-life situations in which a transformation can be used to linearize data from an exponential growth model. Use a logarithmic transformation to linearize a data set that can be modeled by an exponential model. Identify situations in which a transformation is required to linearize a power model. Use a transformation to linearize a data set that can be modeled by a power model. Activity 4: Scatterplot and LSRL of brain weight for 96 species of mammels. What do the outliers tell us? Dolphins and humans are smart, hippos are dumb and elephants are just big. Scatterplot of brain weight against body weight for mammals with outliers removed. Scatterplot and LSRL of the logarithms of brain weight against the logarithm of body weight for 96 species of mammels. Transforming data Applying a function such as the logarithm or square root to a quantitative variable is called transforming or reexpressing the data. Understanding how simple functions work will help us choose and use transformations to straighten nonlinear patterns. Fishing tournament Transforming data with powers. Since weighing live flopping fish would be difficult, it is decided to Relate the length of fish to their weight. Since length is 2-dimensional and weight is three dimensional, it is concluded that the weight of the fish should be proportional to the cube of its length. Thus a model in the form of weight a length3 Should work. The nearby marine research laboratory provides the following data Average length and weight at different ages for Atlantic Ocean rockfish, Age(yr) Length(cm) Weight(g) Age(yr) Length(cm) Weight(g) 1 5.2 2 11 28.2 318 2 8.5 8 12 29.6 371 3 11.5 21 13 30.8 455 4 14.3 38 14 32.0 504 5 16.8 69 15 33.0 518 6 19.2 117 16 34.0 537 7 21.3 148 17 34.9 651 8 23.3 190 18 36.4 719 9 25.0 264 19 37.2 726 10 26.7 293 20 37.7 810 Scatterplot of Atlantic Ocean rockfish weight versus length The scatterplot of weight versus length3 looks linear. We perform a least-squares regression on the transformed points (length3, weight). The resulting equation is weight = 4.066 + 0.0147 X length3 with r2=0.995. So 99.5 % of the variation in the weight of Atlantic Ocean rockfish is accounted for by the linear regression of weight on Length3. Plot of residuals versus length3 For smaller fish, the residuals fall below the line. This indicates that our predictions will be slightly high. For larger fish, our results should be quite accurate. Atlantic Ocean rockfish data with the model weight =4.066+0.147 length3 Prediction Using our model, we can predict the weight of a fish that is 36 centimeters long. Weight= 4.066+0.0147(36)3 = 689.9 grams Assignment P. 199 4.1 P.219 4.13, 4.16 TRANSFORMING WITH POWERS The heirarchy of power functions. The logarithm function corresponds to p=0. Facts about the family of power functions The graph of a linear function (power p=1) is a straight line. Powers greater than 1 (like p=2 and p=4) give graphs that bend upward. The sharpness of the bend increases as p increases. •Powers less than 1 but greater than 0(like p=0.5) give graphs that bend downward. •Powers less than 0(like p=-0.5 and p=-1) gives graphs that decrease as x increases. Greater negative values of p result in graphs that decrease more quickly. Look at the p=0 graph. This is not the graph of y=0x because the zero power is just the constant 1. It is the logarithm logx. A country’s GDP and life expectancy The hierarchy of power transformations Life expectancy and gross domestic product for 115 nations Transformation of Life expectancy and gross domestic product for 115 nations using √GDP Transformation of Life expectancy and gross domestic product for 115 nations using log(GDP) Transformation of Life expectancy and gross domestic product for 115 nations using 1/√GDP Warning: Don’t just push buttons on your calculator to try to straighten out the data. It is more useful to begin with a theory or mathematical model that we expect to describe a relationship. The transformation needed to make the relationship linear is then a consequence of the model. One of the most common models is exponential growth. Growth of a bacteria population over a 24-hr period. The growth of money Understanding Exponenetial Growth A dollar invested at an annual rate of 6% turns into $1.06 in a year. The original dollar remains and has earned $0.06 in interest. If the $1.06 remains invested for a second year, the new amount is therefore $1.06 X $1.06 or (1.06)2. After x year, the dollar has become 1.06x dollars. Wealthy Indians If the Native Americans who sold Manhattan Island for $24 in 1626 had deposited the $24 in a savings account earning 6% annual interest, they would now have almost $80 billion. Moore’s law and computer chips Exponential growth Gordon Moore, one of the founders of Intel, predicted in 1965 that the # of transistors on an integrated circuit chip would double every 18 months. This is “Moore’s Law.” Data on the dates and # of transistors for Intel microprocessors Processor Date Transistors Processor Date Transistor 4004 1971 2,250 486 DX 1989 1,180,000 8008 1972 2,500 Pentium 1993 3,100,000 8080 1974 5,000 Pentium II 1997 7,500,000 8086 1978 29,000 Pentium III 1999 24,000,000 286 1982 120,000 Pentium 4 2000 42,000,000 386 1985 275,000 Scatterplot showing growth in the number of transistors on a computer chip from 1971-2000 Is this exponential growth? The Logarithm Transformation Assignment Worksheet on Logarithms. Scatterplot of ln(number of transistors) versus years since 1970 Plot of transformed transistor data with least-squares line. ln(transistors) = 7.41 + 0.332 (years since 1970) R2=99.5% Residual plot for the transformed data Predictions in the exponential growth model. To predict the number of transistors on Intel’s Itanium 2 8.366chips, which was released in 2003, we substitute 33 for “years since 1970” into the regression equation. Ln(transistors)= 7.41 + 0.332 X (33) =1 Since ln is log base e, this tells us that Transistors= e18.366 =94,678,73 So our model predicts about 95 million transistors on the Itanium 2 chip. In fact they had about 410 million transistors! Growth of a bacteria population over a 24 hour period. Transforming bacteria counts Exact exponential growth Logarithms of the bacteria count Modeling exponential growth with TI-83/84 Enter the data into list 1 and list 2. Use ZoomStatData to draw the scatterplot. Define L3 as the natural logarithm of L2. Then make a scatterplot of ln(L2) versus L1 Next, we perform the least-squares regression on the transformed data. Here is the scatterplot with the LSRL Despite the high r2-value, you should always inspect the residual plot. Be sure to plot the residuals(stored in the RESID list) versus List1. Assignment P. 203, 4.4 Power Law Models 1. 2. The power model is y ax p Take the logarithm of both sides of this equation. You see that log y log a p log x That is, taking the logarithm of both variables results in a linear relationship between logx and logy. 3. Look carefully: the power p in the power law becomes the slope of the straight line that links logy to logx. Predicting brain weight from body weight :Using a Power model log y 1.01 0.72 log x 1.01 0.72 log x y 10 101.01 X 100.72 log x 10.2 10 log x 0.72 Because 10log x =x, the estimated power model connecting predicted brain Weight y^ with body weight x for mammal weight x for mammals is yˆ 10.2 y 0.72 yˆ 10 .2127 0.72 10 .232 .7 333 .7 grams What’s a planet anyway? Planet Distance from sun(astronomical units) Period of revolution (Earth years) Mercury 0.387 0.241 Venus 0.723 0.615 Earth 1.000 1.00 Mars 1.524 1.881 Jupiter 5.203 11.862 Saturn 9.539 29.456 Uranus 19.191 84.070 Neptune 30.061 164.810 Pluto 39.529 248.530 Scatterplot of planetary distance from the sun and period of revolution. The scatterplot of ln(period) versus distance is not linear. The scatterplot of ln(period) vs. ln(distance) appears very linear. Plot of residuals versus ln(distance) The last step is to perform an inverse transformation on the linear regression equation: ln( period) 0.000254 1.5 ln(distance) e ln( period ) e period e 0.000254 1.50 ln(distance) 0.000254 1.50 ln(distance) period 1.000e ln(distance)1.50 period 1.000distance 1.50 Planetary data with power model Power law modeling • • • • • Enter the x data(explanatory) into L1 and the y data into L2. Produce a scatterplot of y versus x. Confirm a nonlinear trend that could be modeled by a power function in the form y=axp. Define L3 to be the logarithm of L1 and define L4 to be the logarithm of L2. Plot L4 versus L3. Verify that the pattern is approximately linear. Calculate the regression equation for the transformed data and store it in Y1. Check the r2 value. Power law modeling cont… • • • Construct a residual plot, in the form of either RESID versus x or RESID vs predicted values. Ideally, the points in a residual plot should be randomly scattered above and below the y=0 reference line. Perform the inverse transformation to find the power function y=axp that models the original data. Define Y2 to be (10a)(xb)or(ea)(xb)depending on the type of logaritm you used for the transformation. The calculator has stored the values of a and b for the most recent regression performed. Deselect Y1 and plot Y2 and the scatterplot for the original data together. To make a prediction for the value x=k, evaluate Y2(k) on the home screen. Assignment 4.1 P. 212 4.6, 4.10, 4.11 4.2 Relationships between Categorical Variables (Objectives) Explain what is meant by a two-way table. Explain what is meant by marginal distributions in a 2-way table. Describe how changing counts to percents is helpful in describing relationships between categorical variables. Explain what is meant by a conditional distribution. Define Simpson’s paradox, and give an example of it. Organizing categorical variables This is a two-way table because it describes two categorical variables. Age group is the row variable because each row in the table describes students in one age group. Sex is the column variable because each column describes one sex. Sex Age Group(years) Female Male Total 15-17 89 61 150 18-24 5,668 4,697 10,365 25-34 1,904 1,584 3,494 35+ 1,660 970 2,630 Total 9,321 7,317 16,639 Marginal Distributions The distributions that appear at the bottom(sex alone) and right margins (age alone) of a 2-way table. Sex Age Group(years) Female Male Total 15-17 89 61 150 18-24 5,668 4,697 10,365 25-34 1,904 1,584 3,494 35+ 1,660 970 2,630 Total 9,321 7,317 16,639 Calculating a marginal distribution The percent of college students 18-24 is totalage 18 - 25 10,365 0.623 62.3% table total 16,639 Age group Percent 15-17 18-24 25-34 35 + 0.9 62.3 21.0 15.8 The total is 100% because everyone is in one of the four categories. A bar graph of the distribution of age for college students. This is one of the marginal distributions for the previous table. Describing Relationships To describe relationships among categorical variables, calculate appropriate percents from the counts given. When we compare the percents of women in the two age groups we are comparing two conditional distributions. Conditional distribution of sex given age Sex(18-24) Female Male 54.7 45.3 Sex(35+) Female Male 63.1 36.9 Bar graph comparing the percent of female college students in four age groups. There are more women than men in all age groups, but the percent of women is higher among older students. Distribution of age given sex Age group Percent of women 15-17 18-24 25-34 35+ 1.0 60.8 20.4 17.8 Age group Percent of men 15-17 18-24 25-34 35+ 0.8 64.2 21.7 13.3 CrunchIt! output of the two way table of college students by age and sex along with each entry as a percent of its row table. The percents in each row give the conditional distribution of sex for one age group, and the percents in the “Total” row give the marginal distribution of sex for all college students. Caution!!!!! No single graph (such as a scatterplot) portrays the form of the relationship between categorical variables. No single numerical measure(such as correlation) summarizes the strength of the association. Assignment P. 245 4.53, 4.55 Smiling Faces Women Smile more than men. Women smile more when they think they are being watched. Men don’t. Within professions, there is no difference between how much women and men smile. Do medical helicopters save lives? Helicopter Road Victim died 64 260 Victim Survived 136 840 Total 200 1100 Serious Accidents Less Serious Accidents Helicopter Road Helicopter Road Died 48 60 Died 16 200 Survived 52 40 Survived 84 800 Total 100 100 Total 100 1000 Assignment P. 212 4.6, 4.10, 4.11 250-252 4.59, 4.60, P.255 4.70 4.3 Establishing Causation 4.3 Objectives Identify the three ways in which the association between two variables can be explained. Explain what process provides the best evidence for causation. Define what is meant by a common response. Discuss why establishing a cause-and-effect relationship through experimentation is not always possible. Define what it means to say that two variables are confounded. Explain what it means to say that a lack of evidence for a causeand-effect relationship does not necessarily mean that there is no cause-and-effect relationship. List five criteria for establishing causation when you cannot conduct a controlled experiment. Six interesting relationships Examining association 1) 2) 3) 4) 5) 6) X=mother’s BMI; Y=daughter’s BMI X= amount of saccharin in a rat’s diet Y=count of tumors in the rat’s body X= a high school senior’s SAT score Y= The students first year college GPA X=monthly flow of money into stock market funds Y=monthly rate of returns for the stock market X=whether a person regularly attends religious services Y= how long the person lives X=the number of years of education a worker has. Y=the worker’s income Variables x and y show a strong association (dashed line). The association may be the result of of any several causal relationships (solid arrow) (b) Changes in both x and y are caused by a lurking variable. (c) The effect (if any) of x on y is confounded with the effect of a lurking variable z. . (a) Changes in x cause changes in y BMI in mothers and daughters; saccharin in rats Mothers and daughter’s BMI? What about heredity? Diet and exercise habits? Even when direct causation is present, it is rarely a complete explanation of an association between two variables. Rats? Do results with rats translate to people? Even well established causal relationships causal relations may not generalize to other settings. Explaining Association: Common Response “Beware the lurking variable” When both x and y change in response to changes in z, this is called a common response. 1. X= A high school senior’s SAT score Y= The students first year college GPA The results of both can be explained by common response to ability and knowledge. 2. X=monthly flow of money into stock market funds Y=monthly rate of returns for the stock market Both the market and individual investors respond to positive feelings about the market Explaining Association: Confounding Confounding: Religion and life span; education and income 1) X=whether a person regularly attends religious services Y= how long the person lives People who go to church are also less likely to smoke, more likely to exercise and less likely to be overweight. 2) X=the number of years of education a worker has. Y=the worker’s income • • People with high ability are more likely to come from prosperous homes and therefore have more education. Being able and rich leads to both higher education and higher income. Establishing Causation The best way to establish causation is through a controlled experiment but this is not always possible. Do Power lines increase the risk of leukemia? A careful study shows no evidence of more than a chance connection between magnetic fields and childhood leukemia. Does smoking cause lung cancer? • • • • • Criteria for establishing causation without experiment. The association is strong. The association is consistent Larger values of the response variable are associated with stronger responses. The alleged cause precedes the effect in time. The alleged cause is plausable Assignment 4.3 P. 237 4.37 P.238-240 4.40, 4.43,4.49 1. Case Closed! Determining Insurance Premiums a) Would a power model provide an appropriate description of the relationship between age and monthly premiums? Transform the data and sketch a graph that will help answer this question. Answer: Let y=premium and x= age. Scatterplots of the original and transformed data after taking the logarithms of both variables show that the original data has a strong nonlinear relationship and the transformed data shows a clear linear trend, so the power model is appropriate. 1. Age Monthly Premium 40 $29 45 $46 50 $68 55 $106 60 $157 65 $257 b) Would an exponential model provide an appropriate description of the relationship between age and monthly premium? Transform the data and sketch a graph that will help answer this question. Answer: The linear trend in the scatterplot of the logarithm of premiumversus age suggests that the exponential model is appropriate. c) Based on your answers to a and b, Use LSR to fit the most appropriate type of model for these data. Perform the inverse transformation to write monthly premiums as a function of age. Answer: The LSRL for the transformed data is logy^=-0.0275+0.0373x. Using the inverse transformation, the predicted premium is y^=10-0.0275100.0373x=0.9386X100.0373x d) Use your model to predict the monthly premium for a 58-year old man who wants a $1 million, 10 yr term life insurance policy. For a 68 yr old. Answer: $136.74 and $322.76 e) How comfortable do you feel about these predictions in d? Justify your answer using a residual plot and r2? Answer: You should feel very comfortable with these predictions. The residual plot shows no clear patterns and r2=99.9%, so the exponential model provides an excellent fit. 2. Death statistics Deaths in the US from selected causes in 2003 15-24 25-44 45-64 Accidents 14,966 27,844 23,699 AIDS 171 6,879 5,917 Cancer 1,628 19,041 144,936 Heart Disease 1,083 16,283 101,713 Homicide 5,148 7,367 2,756 Suicide 3,921 11,251 10,057 Total Deaths 33,022 128,924 437,058 a) Why don’t the entries in each column add to the “total deaths” count? Answer: The entries in each column are only from these six selected causes of deaths. b) Should you use counts or percents to compare the age groups. Answer: Percents should be used because there are different numbers in each group. c)Construct a conditional distribution of cause of death for each age group. Then make a bar graph to display the results. Answer: 15-24 25-44 45-64 Accidents 45.32% 21.60% 5.42% AIDS 0.52% 5.34% 1.35% Cancer 4.93% 14.77% 33.16% Heart Disease 3.28% 12.63% 23.27% Homicide 15.59% 5.71% 0.63% Suicide 11.87% 8.73% 2.30% d) Explain how the leading cause of death changes as people age. 3. Stay Fitter, live Longer A Sign at a fitness center says “mortality is halved for men over 65 who walk at least two miles a day. a) Mortality is eventually 100% for everyone. What do you think mortality is halved means? Answer: The chance of dying for men over 65 who walk at lest 2 miles a day is half of men who do not exercise. b) Assuming that the claim is true, explain why this fact does not prove that walking causes lower mortality. Answer: Individuals who exercise have regularly have other habits that could contribute to longer lives. What you should have learned 1. 2. 3. 4. A. Modeling Linear Data Use powers to transform nonlinear data to achieve linearity. Recognize that, when a variable is multiplied by a fixed number in each equal time period, exponential growth results and that, when one variable is proportional to a power of a second variable, a power law model results. In the case of both exponential growth and power functions, perform a logarithmic transformation and obtain points that lie in a linear pattern. Then use the LSR on the transformed data. Carry out an inverse transformation to produce a curve that models the original data. Know that deviation from the overall pattern are most easily examined by fitting a line to the transformed points and plotting the residuals from this line against the explanatory variable. B. Relations in Categorical Data 1. 2. 3. From a two-way table of counts, find the marginal distributions of both variables by obtaining the row sums and column sums. Describe the relationship between two categorical variables by computing and comparing percents. Often this involves comparing the conditional distributions of one variable for the different categories of the other variable. Construct bar graphs when appropriate. Recognize Simpson’s paradox and be able to explain it. C. Establishing Causation. 1. 2. 3. Recognize possible lurking variables that may help explain the observed association between two variables x and y. Determine whether the relationship between two variables is most likely due to causation, common response, or confounding. Understand that even a strong association does not mean that there is a cause-and-effect relationship between x and y. Chapter Review Assignment. P.257-261 4.72,4.75, 4.77, 4.80,4.83 Monotonic Functions A monotonic function f(t) moves in one direction as its argument t increases. A monotonic increasing function preserves the order of data. That is, if a>b, then f(a)>f(b). A monotonic decreasing function reserves the order of data. That is, if a>b, then f(a)>f(b)